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Chapter  1 
Introduction 


1.1  Motivation 

The  trend  in  robotics  is  to  transition  from  the  traditional  work-cell  model,  in  which 
robots  work  separate  from  human  workers  in  highly  controlled  environments,  to 
a  model  in  which  the  workspace  is  semi-structured,  and  humans  and  robots  work 
within  the  same  space  to  perform  tasks.  Drivers  for  this  transition  include  the 
ability  to  operate  robots  with  less  set  up,  and  the  ability  to  perform  tasks  that  are 
best  achieved  by  leveraging  the  complementary  skills  of  humans  and  robots. 

To  enable  humans  and  robots  to  work  together  safely  and  effectively,  while 
completing  tasks  under  stringent  temporal  and  safety  requirements,  we  require 
robots  to  posses  a  keen  sensitivity  and  responsiveness  to  the  uncertainty  in  their 
environment.  Current  practice  for  ensuring  task  correctness  and  safety  often  re¬ 
quires  groups  of  engineers  to  reason  over  a  very  large  number  of  potential  de¬ 
cisions  and  scenarios  that  might  unfold  during  execution.  Then,  they  manually 
generate  fault  monitoring  codes  and  contingency  procedures  that  account  for  the 
most  likely  scenarios,  which  is  a  challenging,  time-consuming,  and  error-prone 
process. 

The  number  of  possible  execution  scenarios  in  unstructured  environments  is 
often  overwhelming.  For  that  reason,  robot  operators  often  respond  by  choosing  to 
employ  conservative,  very  predictable  robot  task  execution  strategies.  This  is  par¬ 
ticularly  true  when  robots  are  tasked  with  safety-critical  missions,  such  as  infor¬ 
mation  gathering  in  hostile  environments  (outer  space,  deep  sea,  war  zones,  etc.) 
and  operation  in  close  proximity  to  humans.  These  conservative  strategies  tend  to 
be  far  from  ideal  in  terms  of  accuracy  and  throughput,  and  are  often  brittle  to  dis- 
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turbances  due  to  the  execution  strategy’s  inability  to  adapt  to  the  environment.  For 
instance,  a  common  “safe”  execution  strategy  consists  of  a  precomputed  sequence 
of  actions  that  are  robust  to  the  worst-case  scenario.  Such  a  scenario  typically  re¬ 
quires  humans  and  robots  to  work  with  complete  separation,  so  as  to  limit  the 
impact  of  uncertain  human  behavior  on  robot  actions. 

The  Artificial  Intelligence  community  has  been  developing  robotic  task  exe¬ 
cution  strategies  that  improve  robustness  using  on-line  executives.  These  execu¬ 
tives  choose  actions  based  on  the  state  of  the  world,  and  adapt  to  temporal  delays 
by  performing  dynamic  scheduling.  System  managers,  however,  may  resist  the 
adoption  of  these  more  dynamic  methods,  due  to  a  lack  of  explicit  guarantees  of 
correctness  in  terms  of  risk  of  task  failure.  In  light  of  the  latter,  this  project  fo¬ 
cused  on  the  development  of  formal  tools  and  algorithms  allowing  non-experts 
to  easily  specify  desired  safe  behavior  at  a  natural  level  of  abstraction;  verify  the 
correctness  of  execution  strategies  and  schedules  in  the  presence  of  uncertainty; 
and  automatically  generate  such  safe  execution  strategies  and  schedules  from  a 
model  description. 


1.2  Desiderata 

This  project  developed  formal  tools  and  algorithms  that  enable  robots  to  execute 
tasks  while  achieving  high  performance,  defined  by  some  measure  of  utility.  In 
addition  to  utility,  these  executives  focus  on  enabling  robots  to  achieve  these  tasks 
within  deadlines  specified  as  temporal  constraints,  and  enable  robots  to  perform 
these  tasks  while  ensuring  that  hard  safety  guarantees  are  met.  Achieving  these 
goals  presents  a  number  of  challenges: 

1 .  The  robot  should  appropriately  represent  and  reason  about  environment  and 
action  uncertainty,  and  should  use  these  models  to  predict  the  chances  of 
success  of  a  plan  of  action.  It  should  also  be  able  to  plan  actions  optimally, 
while  guaranteeing  a  probability  of  success  specified  by  the  user. 

2.  In  order  to  be  effective,  the  robot  must  be  able  to  react  to  disturbances 
quickly,  in  real  time,  so  that  compensating  actions  (if  any  exist)  are  not 
delayed. 

3.  In  order  to  manage  environmental  uncertainty,  a  robot  must  intelligently 
combine  sensing  actions  with  state-changing  actions,  so  that  the  state-changing 
actions  can  be  accomplished  with  an  adequate  probability  of  success. 
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We  address  these  challenges  using  a  model-based  executive  with  three  key  ca¬ 
pabilities.  To  address  the  first  challenge  (guaranteeing  success  probability),  we 
leverage  our  prior  work  on  Probabilistic  Temporal  Plan  Networks  (pTPN)  [1], 
which  extend  Temporal  Plan  Networks  with  Uncertainty  [2]  by  considering  prob¬ 
abilistic  models  for  uncontrollable  choices  and  allowing  chance  constraints  to 
be  imposed  on  the  violation  of  temporal  constraints.  We  extend  the  chance- 
constrained  approach  of  [3]  to  high-level  temporal  activity  planning  with  un¬ 
certainty  and  sensing  actions,  and  improve  the  robustness  of  our  schedules  by 
modeling  systems  of  temporal  constraints  as  Probabilistic  Simple  Temporal  Net¬ 
works  [4,5],  therefore  allowing  activity  scheduling  to  explicitly  take  into  account 
the  stochastic  nature  of  the  duration  of  various  tasks. 

To  address  the  second  challenge  (achieving  fast  robustness  to  disturbances), 
our  planner  generates  pTPNs  with  sufficient  choice  nodes  to  provide  flexibility 
to  anticipated  uncontrollable  events.  The  generated  pTPN  effectively  encodes  an 
optimal  control  policy,  in  the  form  of  a  conditional  plan,  for  the  correct  responses 
to  any  combination  of  uncontrollable  event  outcomes.  This  allows  the  runtime 
executive  to  interpret  the  plan  (the  pTPN)  to  quickly  make  optimal  decisions,  no 
matter  how  the  uncontrollable  events  turn  out. 

To  address  the  third  challenge  (achieving  adequate  situation  awareness),  our 
planner  generates  pTPNs  that  optimally  combine  sensing  and  state  changing  ac¬ 
tions  so  that  the  overall  situation  awareness  is  adequate  to  achieve  the  goals,  within 
the  success  probability  specified  by  the  user.  We  leverage  previously  developed 
planning  capabilities,  augmenting  them  with  the  capability  of  combining  sensing 
and  state  changing  actions. 

The  goal  of  the  project  was  to  conduct  basic  research  to  create  a  prototype  of 
a  deployable  risk-sensitive  intelligent  system.  While  our  theoretical  contributions 
were  mission-enabling  by  allowing  reasoning  over  probabilistic  uncertainty,  we 
also  ensured  that  the  advances  were  translated  to  useful  technology.  We  present 
the  high-level  description  of  the  desired  characteristics  which  guided  our  efforts 
as  follows: 

1.  The  system  shall  be  risk-sensitive.  Given  the  environment  model  and  a 
description  of  the  actions  and  observations,  the  system  will  generate  a  plan 
and  schedule  to  meet  all  specifications  with  a  guarantee  on  the  probability 
of  success. 

2.  The  system  shall  be  scalable.  The  system  will  solve  for  a  plan  and  sched¬ 
ule  within  a  time  frame  appropriate  to  the  number  of  variables  in  the  input 
mission  description. 
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3.  The  system  shall  have  contingency  plans.  The  system  will  solve  for  a  nom¬ 
inal  plan,  but  also  have  back  up  strategies  in  the  case  of  particularly  large 
deviations  from  what  was  expected. 

4.  The  system  shall  be  transferable  across  problem  domains.  The  intelligent 
system  shall  allow  both  operations  level  planning  and  support  direct  control 
of  hardware  assets  in  a  variety  of  problem  scenarios,  dealing  with  different 
vehicles  across  different  environments. 

5.  The  system  shall  be  easy  to  operate  by  non-expert  users.  The  intelligent 
system  should  allow  the  user  to  specify  a  set  of  desired  outcomes  and  rea¬ 
son  over  automatically  instantiated  actions.  As  the  actions  are  no  longer 
hand-coded,  it  addresses  the  scalability  issues  which  arise  when  describing 
problems  with  a  large  number  of  actions,  and  does  not  require  the  user  to 
have  prior  knowledge  of  how  each  action  affects  the  environment. 

1.3  Contributions 

An  overview  of  Enterprise ,  the  model-based  programming  and  execution  archi¬ 
tecture  implementing  our  risk-sensitive  intelligent  system,  is  given  in  Chapter  2. 
We  describe  the  flow  of  information  and  control,  from  high-level  planning  and 
scheduling  to  low-level  dispatch  and  execution  layers,  and  provide  evidence  of 
Enterprise’ s  ease-of-use  by  non-experts,  mission-enabling  features,  and  transfer- 
ability  across  domains  by  detailing  how  it  has  been  deployed  in  support  of  two 
real-world  applications:  an  undergraduate  course  at  MIT  involving  autonomous 
quadcopters;  and  coordinating  science-gathering  missions  for  autonomous  under¬ 
water  vehicles  operated  by  the  Woods  Hole  Oceanographic  Institution  (WHOI). 
Subsequent  chapters  present  the  different  modules  and  tools  comprising  Enter¬ 
prise. 

In  Chapter  3,  we  describe  cRMPL,  an  extension  of  the  model-based  program¬ 
ming  language  RMPL  [6]  that  allows  missions  with  state  and  temporal  uncer¬ 
tainty,  in  addition  to  risk  bounds  in  the  form  of  chance  constraints,  to  be  specified 
at  a  high  level  of  abstraction.  In  order  to  be  practically  useful,  a  model-based  pro¬ 
gramming  language  must  allow  easy  and  reusable  modeling  of  components  and 
their  hierarchical  compositions  into  more  complex  systems;  and  enable  program¬ 
mers  to  specify  the  desired  behavior  of  autonomous  systems  in  terms  of  desired 
state,  rather  than  low-level  control  commands.  Our  experiments  in  Chapter  7  indi¬ 
cate  how  easily  cRMPL  can  be  used  as  a  modeling  and  control  specification  tool 
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by  demonstrating  it  in  the  temporal  coordination  of  science  agents  operating  under 
temporal  uncertainty,  and  in  specifying  an  on-line  execution  policy  that  allows  a 
robot  to  adapt  to  its  human  coworker  in  a  collaborative  manufacturing  application. 

Chapters  4  and  6  present  an  overview  of  formal  verification  tools  that  allow 
conditional  plans  with  sensing  and  temporal  uncertainty  to  be  checked  against 
safety  specifications  in  the  form  of  chance  constraints.  The  algorithms  in  Chap¬ 
ter  4,  whose  detailed  description  is  given  in  [1],  pursue  a  diagnostic  approach  to 
quickly  detect  risky  plan  branches  in  a  plan  and  verify  if  the  probability  of  fail¬ 
ure  they  incur  is  compatible  with  the  given  risk  bounds.  Picard ,  the  scheduling 
algorithm  presented  in  Chapter  6  and  thoroughly  explained  in  [7],  is  capable  of 
handling  probabilistic  uncertainty  in  the  timing  of  actions  and  compute  activity 
schedules  that  are  guaranteed  to  meet  their  temporal  deadlines  with  high  proba¬ 
bility. 

In  situations  where  the  manual  specification  of  cRMPL  programs  is  tedious 
or  even  intractably  large  to  be  done  explicitly,  one  can  use  RAO*,  a  planning 
algorithm  capable  of  deriving  contingent  execution  policies  with  guarantees  on 
the  probability  of  success  in  domains  with  state  uncertainty.  It  is  briefly  described 
in  Chapter  5,  with  a  scalability  analysis  and  more  detailed  explanation  given  at 
the  appendices. 

In  addition  to  the  experimental  results  described  in  Chapter  2  and  available  at 
the  scientific  publications  resulting  from  this  project,  we  present  in  Chapter  7  a 
suite  of  different  demonstrations  evaluating  Enterprise  against  the  desiderata  set 
out  above.  We  provide  evidence  that  our  methods  are  generally  applicable  and 
transferable  by  presenting  experiments  in  three  different  domains:  coordination 
of  Mars  rovers;  collaborative  human-robot  manufacturing;  and  unmanned  aerial 
scouts.  We  also  provide  insight  and  references  to  research  in  learning  probabilistic 
models  of  the  environment  in  support  of  model-based  reasoning  and  execution, 
such  as  our  algorithm  for  learning  Probabilistic  Hybrid  Automata  (PHA)  from 
experimental  data  [8].  Finally,  we  present  our  conclusions  in  Chapter  8. 


1.4  Publications  attributed  to  this  project 

For  ease  of  reference,  the  following  is  a  list  of  publications  concerning  work  de¬ 
veloped  under  this  project: 

•  Fang  et  al.,  “Chance-Constrained  Probabilistic  Simple  Temporal  Problems” 
[7]; 
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Santana  &  Williams, “Chance-Constrained  Consistency  for  Probabilistic  Tem¬ 
poral  Plan  Networks”  [1] 

Santana  et  al.,  “Learning  Hybrid  Models  with  Guarded  Transitions”  [8]; 

Santana  &  Williams,  “Dynamic  Execution  of  Temporal  Plans  with  Sensing 
Actions  and  Bounded  Risk”  [9]; 

Santana  et  al.,  “RAO*:  an  Algorithm  for  Chance-Constrained  POMDP’s” 
(Appendix  A). 
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Chapter  2 

Risk-sensitive  model-based 
execution 


In  this  chapter  we  present  an  overview  of  Enterprise ,  the  model-based  program¬ 
ming  and  execution  architecture  used  by  our  risk-sensitive  intelligent  system. 
We  describe  the  flow  of  information  and  control,  from  high-level  planning  and 
scheduling  to  low-level  dispatch  and  execution  layers,  and  provide  evidence  of 
Enterprise's  ease-of-use  by  non-experts,  mission-enabling  features,  and  transfer- 
ability  across  domains  by  detailing  how  it  has  been  deployed  in  support  of  two 
real-world  applications:  an  undergraduate  course  at  MIT  involving  autonomous 
quadcopters;  and  coordinating  science- gathering  missions  for  autonomous  under¬ 
water  vehicles  operated  by  the  Woods  Hole  Oceanographic  Institution  (WHOI). 


2.1  Enterprise 

Modern  robotic  systems  consist  of  a  variety  of  components  acting  together.  Broadly 
speaking,  there  is  the  hardware  layer  consisting  of  the  actuators  and  sensors,  the 
control  system  layer  that  drives  the  hardware  to  a  desired  configuration,  the  plan¬ 
ning  and  execution  layer  that  generates  action  sequences,  and  the  input  layer 
which  can  be  a  human  or  another  automated  system  that  provides  goals.  En¬ 
terprise  is  a  framework  for  the  planning  and  execution  layer  and  is  developed  by 
the  MERS  group. 

As  shown  in  Figure  2.1,  Enterprise  accepts  as  input  the  goals  that  should  be 
achieved,  a  model  of  the  robot,  and  a  model  of  the  environment  and,  over  time, 
outputs  commands  that  are  executable  by  the  controls  layer.  Enterprise  accepts 
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Figure  2.1:  Conceptual  interface  of  Enterprise. 


the  model  and  goal  description  in  the  form  of  an  RMPL  (Reactive  Model-based 
Programming  Language)  program.  Then,  activity  planners,  path  planners,  and 
schedulers  are  called  as  needed  to  create  an  executable  plan  in  the  form  of  a  tem¬ 
poral  plan  network  (TPN).  Last,  a  dispatcher  is  provided  the  executable  TPN  and 
it  invokes  the  appropriate  actions  in  the  control  layer. 


Enterprise  is  designed  for  a  variety  of  planners  to  be  plugged  into  it  to  pro¬ 
vide  its  described  end-to-end  capability.  The  current  version  of  Enterprise  allows 
only  for  a  manually  specified,  static  configuration  of  planners.  Communication 
between  planners  is  done  through  temporal  plan  networks,  with  each  planner  and 
dispatcher  accepting  TPNs  with  a  subset  of  the  allowable  features  of  a  TPN  (e.g., 
goals,  controllable  decisions,  uncontrollable  decisions,  etc.).  For  this  work,  En¬ 
terprise  was  configured  to  use  pKirk,  Picard,  and  Pike  as  shown  in  Figure  2.2. 
pKirk  takes  a  goal  description  and  models  in  the  form  of  cRMPL  and  generates 
an  executable  TPN  with  choice.  pKirk  is  described  in  depth  in  Chapter  5.  Picard 
is  given  a  chance-constrained  probabilistic  simple  temporal  problem  (cc-pSTP) 
and  determines  if  it  is  feasible.  Picard  is  used  by  Kirk  to  check  TPNs  as  they 
are  being  constructed  and  is  described  in  depth  in  Chapter  6.  Pike  is  a  dispatcher 
and  execution  monitor  that  sends  commands  to  the  control  layer  to  be  executed 
at  the  appropriate  times  and  monitors  the  sensors  to  know  which  choices  in  the 
executable  TPN  have  been  made.  Pike  is  described  in  depth  in  [10]. 
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bRMPL 


Figure  2.2:  Configuration  of  pKirk,  Picard,  and  Pike  within  Enterprise. 

2.2  Enterprise  deployments 

The  Enterprise  system  has  been  used  in  two  real-world  deployments  by  people 
outside  of  the  MERS  group.  While  neither  of  these  deployments  used  specifically 
the  two  main  algorithms  developed  in  this  project,  pKirk  or  Picard,  both  detailed 
later  in  this  report,  they  serve  to  demonstrate  the  broad  applicability  and  transfer- 
ability  across  domains  offered  by  Enterprise. 

2.2.1  Crash  course  in  autonomy 

During  the  month  of  January,  2015,  the  MERS  group  (including  the  authors), 
taught  a  hands  on,  intro  level  autonomy  course  aimed  at  MIT  freshmen  and  sopho¬ 
mores,  titled  “Crash  Course  in  Autonomy.”  The  course  was  broken  into  five  dif¬ 
ferent  modules  covering  different  aspects  autonomous  systems.  In  each  module, 
students  were  provided  an  intro  lecture,  an  advanced  lecture,  and  a  lab  assignment 
to  apply  the  technologies  covered  in  the  module  to  simulated  and  real  ARDrone 
quadrotors. 

For  the  hands-on  lab  assignments,  the  students  used  different  instantiations  of 
Enterprise  appropriate  for  the  lab  at  hand.  For  each  of  the  five  modules,  Enterprise 
was  used  in  the  following  configurations: 

•  Module  1:  Scripting  -  In  this  configuration,  students  were  required  to  write 
an  RMPL  program  that  compiled  directly  into  an  executable  TPN.  Enter¬ 
prise  was  configured  to  only  use  Pike  as  a  dispatcher. 
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•  Module  2:  Localization  -  Enterprise  was  unchanged  from  Module  1.  In¬ 
stead,  the  control  layer  was  augmented  to  include  vision-based  localization 
and  mapping. 

•  Module  3:  Path  planning  -  Enterprise  was  configured  to  include  basic  visi¬ 
bility  graph  and  grid-based  path  planners.  Students  wrote  RMPL  programs 
in  terms  of  traversals  between  named  locations  instead  of  specifying  the 
coordinates  of  a  trajectory  to  follow. 

•  Module  4:  Activity  planning  -  Enterprise  was  configured  to  include  the 
tBurton  activity  planner  [1 1]  in  addition  to  the  path  planners.  Students  wrote 
RMPL  programs  in  terms  of  goals  instead  specifying  which  actions  to  take. 

•  Module  5:  Scheduling  -  Enterprise  was  configured  to  include  the  Kirk 
planne  [12]  in  addition  to  the  path  planners.  Students  wrote  RMPL  pro¬ 
grams  in  terms  of  which  actions  to  perform,  but  were  allowed  to  specify 
choices  between  different  action  sequences  that  achieved  the  same  goal. 

As  this  was  an  introductory  course,  risk-aware  planners  and  schedulers  were  not 
covered. 

Sixteen  students  participated  in  this  class.  The  majority  of  the  students  self- 
reported  as  having  little  to  no  experience  with  autonomous  systems,  but  also  stated 
they  found  Enterprise  easy  to  use  and  understand. 

2.2.2  WHOI 

The  Enterprise  architecture  has  also  been  deployed  to  control  a  Slocum  glider 
autonomous  underwater  vehicle  (AUV)  owned  and  operated  by  scientists  at  the 
Woods  Hole  Oceanographic  Institution  (WHOI).  This  capability  was  demonstrated 
during  a  technology  validation  expedition  on  the  R/V  Falkor  in  the  Scott  Reef  la¬ 
goon  in  the  Timor  Sea  from  March  24  to  April  6,  2015.  In  this  demonstration, 
Enterprise  was  used  as  a  decision  support  system  to  plan  for  glider  operations  in 
the  presence  of  five  other  AUVs  and  the  Falkor  itself.  The  human  operators  used 
Enterprise  to  plan  a  series  of  observations  of  target  regions  in  between  surfacing 
for  data  communication  and  plan  underwater  paths  for  the  observations. 

In  the  demonstration.  Enterprise  was  used  with  RMPL  input,  the  Kirk  planner, 
and  a  simplified,  risk-aware  path  planner.  Due  to  technical  constraints  imposed 
by  the  glider  control  system.  Pike  was  not  used  as  a  dispatcher.  Instead,  the  exe¬ 
cutable  TPN  was  translated  directly  into  the  glider’s  scripting  language. 
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Figure  2.3:  Mission  goals  for  the  Slocum  glider  during  the  Scott  Reef  deployment. 


Figure  2.3  illustrates  the  mission  goals  for  the  glider  deployed  during  the  expe¬ 
dition.  Operators  discretized  a  specific  area  of  the  lagoon  in  15  regions  of  interest, 
cells,  to  be  visited  by  the  glider.  Each  cell  was  assigned  a  priority  and  a  path  (red 
dashed  line)  for  the  glider  to  traverse.  All  AUVs  on  the  deployment  shared  the 
cells,  but  each  had  unique  goals  in  each  cell.  In  order  to  avoid  collisions,  a  con¬ 
straint  was  placed  on  the  AUVs  that  no  more  than  one  AUV  could  occupy  a  cell 
at  a  time. 

At  the  beginning  of  the  deployment,  the  path  planner  was  used  without  Kirk 
to  plan  transits  for  nine  days  in  initial  testing.  At  the  time  of  the  cruise,  the  path 
planner  used  a  simplified  dynamics  model  and  relied  on  the  operator  for  risk  al¬ 
location.  Kirk  and  the  path  planner  were  then  used  in  conjunction  to  successfully 
plan  for  two  days  of  eight  hour  operations  for  the  glider.  The  activity  planner  ef¬ 
ficiently  1)  selected  subset  of  science  goals  with  highest  return  based  on  science 
preference,  and  2)  ordered  and  scheduled  visitation  to  respect  the  aforementioned 
constraints.  Ocean  currents  in  Scott  Reef  changed  frequently  and  posed  a  chal¬ 
lenge  for  the  AUVs  deployed  during  the  expedition.  The  path  planning  component 
successfully  planned  safe  routes  around  the  reef.  Moreover,  we  demonstrated  the 
executives  capability  to  support  re -planning  after  each  glider  surface  activity.  To 
the  best  of  our  knowledge,  a  Slocum  glider  has  never  before  been  used  inside  a 
reef  before,  due  to  the  challenges  present  in  that  environment. 
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Chapter  3 

Programming  risk-aware  missions 
with  cRMPL 


This  project  developed  a  chance-constrained,  reactive  model-based  programming 
language  (cRMPL),  that  allows  the  desired  behavior  of  autonomous  systems  to  be 
specified  at  a  high-level  of  abstraction.  It  extends  the  original  RMPL  [13]  with  the 
following  features:  I)  sensing  actions,  in  the  form  of  probabilistic  choice  nodes; 
II)  probabilistic  temporal  uncertainty;  III)  safety  guarantees  in  the  form  of  chance 
constraints;  and  IV)  state  constraints. 

Mission  specifications  in  cRMPL  offer  flexibility  in  the  choice  of  action  se¬ 
quences  used  to  reach  goals,  which  are  exploited  during  execution  to  achieve  ro¬ 
bustness.  In  cRMPL,  like  traditional  programs,  time-evolved  behavior  is  spec¬ 
ified  using  standard  control  constructs,  including  parallel  and  sequential  execu¬ 
tion,  conditional  execution,  iteration,  contingent,  and  timed  execution.  The  latter 
enables  the  specification  of  time-critical  missions. 

Unlike  traditional  languages,  cRMPL  bounds  the  risk  of  execution  failure,  by 
allowing  a  chance  constraint  to  be  specified  over  any  cRMPL  (sub)expression. 
This  constraint  specifies  a  maximum  probability  that  that  expression  will  fail  to 
terminate  successfully.  To  improve  robustness,  cRMPL  includes  several  con¬ 
structs  for  introducing  choices  that  the  executive  makes  at  run-time,  in  order  to 
adapt  to  delays  and  failures.  This  includes  decision-theoretic  choices  between 
functionally-equivalent  procedures  and  bounds  on  procedure  execution  time.  Pro¬ 
grams  in  cRMPL  can  also  be  elevated  from  specifying  action  sequences  to  spec¬ 
ifying  desired  state  evolutions,  by  introducing  state  constraints  as  program  prim¬ 
itives.  The  executive  maps  states  to  actions  by  continuously  planning  using  a  set 
of  action  models. 
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3.1  Features  of  cRMPL 


Our  implementation  of  cRMPL  is  in  the  form  of  an  extension  of  Python,  a  widely 
used,  general-purpose  programming  language.  For  that  reason,  cRMPL  programs 
are  represented  as  objects  of  the  RMPyL  class,  which  stands  for  “RMPL  in  Python”. 
A  list  of  selected  cRMPL  features  follows: 

•  since  cRMPL  is  a  Python  module,  anything  available  in  Python  can  be  used 
to  manipulate  cRMPL  objects,  such  as  list  comprehensions,  dictionaries, 
recursion,  file  I/O,  network  interfaces,  etc.; 

•  cRMPL  only  depends  on  the  availability  of  a  Python  interpreter  (either 
Python  2  or  3)  and  its  standard  libraries,  therefore  being  cross-platform 
(Windows,  Mac,  and  GNU/Linux); 

•  cRMPL  is  useful  for  sequencing  (describing  a  temporal  program  program¬ 
matically),  as  well  as  for  modeling  tasks; 

•  cRMPL  is  built  around  the  concept  of  Episodes  (Section  3.2)  and  their  com¬ 
positions.  Therefore,  it  is  very  easy  to  write  libraries  of  cRMPL  subroutines 
that  can  be  composed  to  form  more  complex  cRMPL  programs; 

•  cRMPL  programs  can  be  translated  into  Probabilistic  Temporal  Plan  Net¬ 
works  (pTPN’s)  [1]  and  dispatched  by  the  pKirk  executive,  both  developed 
in  the  context  of  this  grant. 


3.2  Episodes 

The  core  building  block  of  cRMPL  is  the  episode  (see  Figure  3.1).  An  episode 
can  be  intuitively  understood  as  a  period  of  time  during  which  an  activity  must  be 
performed,  while  respecting  a  family  of  state,  temporal,  and  chance  constraints. 
If  the  activity  can  be  directly  executed  by  the  autonomous  system  under  control, 
we  call  the  episode  primitive.  Alternatively,  if  the  activity  to  be  performed  within 
an  episode  consists  of  a  combination  of  other  episodes,  we  call  the  outer  episode 
composite.  The  ways  in  which  episodes  can  be  combined  in  cRMPL  are  explained 
in  Section  3.4. 

More  formally,  an  episode  E  is  a  tuple  <s,  e,  A,  S.  T,  C>,  where  s  and  e  are, 
respectively,  the  temporal  events  marking  the  start  and  end  of  E\  A  is  either  a 
primitive  activity  (can  be  readily  executed  by  the  autonomous  agent)  or  another 

13 


DISTRIBUTION  A:  Distribution  approved  for  public  release. 


Activity 


State  constraint 

Ituav-scarbj 


- - ^  '  I  '  V  1  ■  » M  /  J  /  - - . 

J— (((Healthy=(True)))}*f 


Start  event 


[1.  10] 

I 

Duration 


End  event 


Figure  3.1:  Example  episode  specifying  that  an  unmanned  aerial  vehicle  (UAV) 
should  scan  an  area  for  a  period  between  1  and  10  time  units,  while  making  sure 
that  it  maintains  itself  in  a  healthy  state.  If  uav-scan  can  be  directly  executed  by 
the  UAV,  this  would  be  a  primitive  episode.  Otherwise,  if  uav-scan  requires  a 
combination  of  more  fundamental  operators  (according  to  the  operations  in  Sec¬ 
tion  3.4),  then  this  episode  would  be  composite. 


episode;  and  §,  T,  C  are,  respectively,  sets  of  state,  temporal,  and  chance  con¬ 
straints  that  should  hold  during  the  period  of  time  E  is  being  executed.  In  cRMPL, 
episodes  are  instances  of  the  Episode  class. 

The  next  section  provides  further  details  about  the  types  of  constraints  that  can 
be  represented  within  episodes. 


3.3  Constraints  in  cRMPL 

Constraints  in  cRMPL,  regardless  of  their  specific  type,  are  split  into  two  groups: 
model  and  user-defined.  This  characterization  is  particularly  important  in  the  con¬ 
text  of  chance  constraints  (Section  3.3.3),  since  they  can  only  be  imposed  over 
user-defined  state  and  temporal  constraints.  A  model  constraint  is  one  that  stems 
from  physical  limitations  of  the  system  at  hand,  and,  therefore,  always  holds.  One 
can  mention  as  examples  of  model  constraints  the  conservation  of  flow  in  network 
problems;  the  degrees  of  freedom  of  a  robotic  arm;  and  the  maximum  speed  that 
a  vehicle  can  attain.  On  the  other  hand,  user-defined  constraints,  as  their  name 
says,  are  externally  imposed  on  the  system  by  the  cRMPL  programmer  to  cause  it 
to  behave  appropriately.  Lor  instance,  speed  limits  on  highways  are  user-defined 
constraints  that  dictate  the  desired  behavior  for  drivers  on  that  road.  Similarly,  a 
cRMPL  programmer  could  impose  the  state  constraint  “stay  away  of  no-fly  zones” 
in  the  episode  in  Ligure  3.1. 

With  this  distinction  in  mind,  the  following  sections  provide  further  details 
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about  the  types  of  constraints  that  are  supported  within  episodes  in  cRMPL. 


3.3.1  Temporal  constraints 

Three  basic  types  of  temporal  constraints  are  supported  in  cRMPL:  simple  tem¬ 
poral  constraints  (STCs)  [14];  STCs  with  uncertainty  (STCUs)  [15];  and  proba¬ 
bilistic  STCs(pSTCs)  [4,5]‘. 

An  STC  over  two  temporal  events  e\  and  e2  is  a  tuple  <ei,  e2,  l,  u>  imposing 
the  constraint  e2  —  e\  G  [/,  u],  l  <  u,  l,u  G  M.  For  an  STC,  both  e\  and  e2  can  be 
freely  chosen. 

An  STCU  is  very  similar  to  an  STC:  it  is  given  by  a  tuple  <ei,  e2,  /,  u>,  l  <  u, 
l,u  e  M+.  However,  for  an  STCU,  the  value  of  e2  is  uncontrollable:  it  will  be 
chosen  by  the  environment  during  execution  so  that  e2  —  e\  E  [l,  u],  but  its  specific 
value  cannot  be  chosen  beforehand. 

A  pSTC  extends  STCUs  by  allowing  random  variables  with  known  proba¬ 
bility  distributions  to  describe  the  temporal  distance  between  two  events.  More 
formally,  a  pSTC  is  a  tuple  <ei,  e2,  v>  such  that  e2  =  e\  +  v,  where  v  is  a  ran¬ 
dom  variable.  As  with  STCUs,  the  specific  value  of  e2  is  not  directly  controllable: 
it  is  chosen  by  the  environment  according  to  the  value  of  e±  and  the  probability 
distribution  of  v. 


3.3.2  State  constraints 

Let  A"  be  a  vector  of  state  variables.  The  current  implementation  of  cRMPL  sup¬ 
ports  discrete  state  variables  over  finite  value  domains,  and  numerical  state  vari¬ 
ables  over  continuous  ranges  of  values.  For  those,  two  types  of  states  constraints 
are  available: 

•  general  linear  constraints  of  the  form  AXn  A  b,  where  A,  b  are  constant 
matrices  of  appropriate  dimensions  and  Xn  is  the  subset  of  A"  composed  of 
numerical  state  variables; 

•  and  assignment  constraints  X  =  c,  where  c  is  a  vector  of  constants. 

'Disjunctive  and  conditional  temporal  constraints  of  different  types  can  be  represented  as  com¬ 
binations  of  these  different  types  of  simple  temporal  constraints  with  the  choice  operators  from 
Section  3.4.3. 
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Alternatively,  one  can  also  specify  a  Boolean  constraint-checking  function 
fs( X),  which  checks  if  the  vector  of  state  variables  X  satisfies  the  state  con¬ 
straints  S  C  §. 

There  is  also  a  distinction  in  terms  of  the  period  of  time  during  which  state 
constraints  should  hold: 

at  start  :  state  constraints  that  should  hold  by  the  time  the  start  event  of  an 
episode  is  executed; 

at  end  :  same  as  at  start ,  but  for  the  end  event  of  an  episode; 

during  :  state  constraints  that  should  hold  during  the  whole  period  between  the 
start  and  end  events,  but  not  necessarily  at  the  extrema; 

overall  :  state  constraints  that  should  hold  at  all  previously  mentioned  periods. 

3.3.3  Chance  constraints 

The  ability  to  impose  chance  constraints  on  episodes  is  one  of  the  most  important 
features  of  cRMPL.  In  essence,  a  chance  constraint  provides  a  bound  A  on  the 
probability  of  a  set  of  user-defined  constraints  R,  R  C  (§  U  TT),  being  violated 
during  the  execution  of  the  episode.  Therefore,  a  chance  constraint  is  a  tuple 
C  =  <R,  A>.  Model  constraints  are  not  included  in  chance  constraints  because 
they  are  trivially  satisfied  by  the  underlying  physics. 


3.4  Composing  episodes  in  cRMPL 

As  in  the  original  RMPL,  cRMPL  subroutines  can  be  hierarchically  combined  into 
composite  episodes  using  sequence,  parallel,  and  choice  operators.  There  is  also 
iteration,  which  is  built  on  top  of  sequence  and  choice.  As  previously  mentioned, 
cRMPL  programs  are  represented  as  instances  of  the  RMPyL  class,  which  are 
capable  of  specifying  complex  temporal  behavior  by  combining  episodes  through 
the  operations  described  in  this  section. 

3.4.1  Sequence  composition 

In  a  sequence  composition  of  episodes  (Figure  3.2),  one  is  executed  immediately 
after  the  other,  with  an  optional  controllable  slack  between  them  represented  by  a 
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[0,  oo]  STC.  In  cRMPL,  a  list  of  episodes  can  be  composed  in  sequence  through 
the  class  method  RMPyL.sequence .  Alternatively,  pairs  of  episodes  can  be  com¬ 
posed  in  sequence  by  the  overloaded  binary  operator 


0-Ch^_K> 


hO—O 


Figure  3.2:  Excerpt  of  a  pTPN  depicting  a  sequential  composition  of  episodes. 


3.4.2  Parallel  composition 

In  a  parallel  composition  of  episodes  (Figure  3.3),  there  is  a  common  event  where 
the  parallel  composition  starts,  followed  by  the  simultaneous  execution  of  all 
episodes  in  the  composition,  and  a  common  end  event  where  all  parallel  execution 
branches  come  to  an  end.  In  cRMPL,  a  list  of  episodes  can  be  composed  in  paral¬ 
lel  through  the  class  method  RMPyL.parallel.  Alternatively,  pairs  of  episodes  can 
be  composed  in  sequence  by  the  overloaded  binary  operator  “+”. 


Figure  3.3:  Excerpt  of  a  pTPN  depicting  a  parallel  composition  of  episodes. 


3.4.3  Choice  composition 

A  choice  composition  is  similar  to  a  parallel  one  in  terms  of  structure,  but  only  one 
of  the  branches  is  ever  executed.  Therefore,  it  represents  a  disjunction.  Choices 
are  pictorially  represented  as  double  circles,  as  shown  in  Figure  3.4.  Choices 
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are  either  controllable  (assigned  by  the  control  program)  or  uncontrollable  (as¬ 
signed  by  an  external  agent,  such  as  a  sensor  reading).  Uncontrollable  choices 
can  either  be  non-deterministic  and  with  no  associated  probability  distribution; 
or  probabilistic,  where  there  is  a  probability  associated  with  each  possible  out¬ 
come.  Choice  compositions  are  implemented  in  cRMPL  by  the  class  method 
RMPyL.choose.  Since  controllable  choices  are  often  used  to  represent  decision 
by  the  program  executive,  while  uncontrollable  choices  usually  represent  sensor 
readings  and  other  types  of  environmental  observations,  cRMPL  programs  also 
have  the  syntactic-sugar  constructs  RMPyL.decide  and  RMPyL.observe  to  repre¬ 
sent,  respectively,  controllable  and  uncontrollable  choices. 


Figure  3.4:  Excerpt  of  a  pTPN  depicting  a  choice  (disjunction)  composition  of 
episodes. 


3.4.4  Iteration  composition 

Iterations  in  cRMPL  are  implemented  by  the  class  method  RMPyLdoop.  It  is  re¬ 
cursively  defined  using  the  previously  described  sequence  and  choice  operators, 
modeling  the  process  of  choosing  to  execute  an  activity  one  more  time,  and  ter¬ 
minate  the  looping  behavior.  Iterations  constructed  with  controllable  choices  give 
the  program  executive  flexibility  to  execute  an  activity  one  or  more  times  (up  to 
a  maximum  number  of  repetitions),  should  that  be  beneficial  to  the  mission.  On 
the  other  hand,  loops  with  uncontrollable  choices  can  be  used  to  represent  an 
externally-controller  looping  behavior. 
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Figure  3.5:  Excerpt  of  a  pTPN  depicting  an  iterative  composition  of  episodes.  At 
the  beginning  of  each  iteration,  there  is  a  choice  between  executing  the  loop  once 
more,  or  ceasing  the  iteration. 

3.5  Simple  UAV  scenario  in  cRMPL 

Figure  3.6  presents  a  “Hello  World”  example  showing  how  cRMPL  can  be  used 
for  modeling,  as  well  as  for  programming  desired  temporal  behavior. 
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from  rmpyl  .  rmpyl  import  RMPyL ,  Episode  ,  ChanceConstraint 
class  UAV :  #UAV  class  that  generates  episodes 
def  fly  (self): 

return  Episode  (  duration  ={  ’ctype  ’  :  ’  controllable  ’  , 

’ lb ’ :3 , ’ub ' : 10}  , 
action  =  ’  (  f ly  )  ’ ) 

def  scan ( self ) : 

return  Episode  (  duration  ={  ’  ctype  ’  :  ’  controllable  ’  , 

’  lb  ’  :  1  ,  ’ub' : 1 0}  , 
action=  ’  (  scan  )  ’ ) 

hello  =  UAV()  ;  uav  =  UAV() 

pr  =  RMPyL  ( ) 

pr  .  plan  =  pr. sequence ( 

pr  .  parallel  ( 
pr . sequence ( 

hello  .  scan  ()  , 
hello  .  fly  ()  )  , 
pr  . sequence ( 
uav  .  f ly  ()  , 
uav . scan  ( )  ) )  , 

pr. decide({  name  ’  :  ’UAV— choice ’ , 

’  domain  ’  :  [  Hello  ’  ,  ’UAV’  ]  , 

’  utility  ’  :  [ 5  ,7]}  , 
hello  .  fly  ()  , 
uav  .  fly  ()  )) 

tc  =  pr.  add_overall_temporal_constraint(ctype=’  controllable  ’  , 

lb  =0.0  , ub  =  1  8 .0) 

cc_time  =  ChanceConstraint  (  constraint_scope  =[  tc  ],  risk  =0. 1 ) 
pr.add_chance_constraint(cc_time) 


(a)  “Hello  World”  example  in  cRMPL. 


(b)  Corresponding  pTPN. 


Figure  3.6:  “Hello  World”  example  in  cRMPL  and  its  corresponding  representa¬ 
tion  as  a  pTPN. 
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Chapter  4 

Risk-bounded  consistency  of 
Probabilistic  Temporal  Plan 
Networks 


Autonomous  agents  often  are  not  adopted  in  highly  uncertain  environments  due 
to  the  risk  of  mission  failure  and  loss  of  vehicles.  Prior  work  on  contingent  plan 
execution  addresses  this  issue  by  placing  bounds  on  uncertain  variables  and  by 
providing  consistency  guarantees  for  a  ‘worst-case’  analysis,  which  tends  to  be 
too  conservative  for  real-world  applications.  This  chapter  presents  work  that  uni¬ 
fies  features  from  risk-sensitive  trajectory  optimization  and  high-level  plan  exe¬ 
cution  in  order  to  extend  existing  guarantees  of  consistency  for  conditional  plans 
to  a  chance-constrained  setting.  The  result  is  a  set  of  efficient  algorithms  for 
computing  plan  execution  policies  with  explicit  bounds  on  the  risk  of  failure.  To 
accomplish  this,  we  introduce  Probabilistic  Temporal  Plan  Networks  (pTPN’s), 
which  improve  upon  previous  formulations  by  incorporating  probabilistic  uncer¬ 
tainty  and  chance-constraints  into  the  plan  representation.  We  develop  a  novel 
method  to  the  chance-constrained  strong  consistency  problem,  by  leveraging  a 
conflict-directed  approach  that  searches  for  an  execution  policy  that  maximizes 
reward  while  meeting  the  risk  constraint.  Experimental  results  indicate  that  our 
approach  for  computing  strongly  consistent  policies  has  an  average  scalability 
gain  of  about  one  order  of  magnitude,  when  compared  to  current  methods  based 
on  chronological  search. 
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4.1  Representing  uncertainty  in  contingent  tempo¬ 
ral  plans 

Real-world  environments  are  inherently  uncertain,  causing  agents  to  inevitably 
experience  some  level  of  risk  of  failure  when  trying  to  achieve  their  goals.  Instead 
of  neglecting  the  existence  of  risk  or  overlooking  the  fact  that  unexpected  things 
might  have  significant  impacts  on  a  mission,  it  becomes  key  for  autonomous  sys¬ 
tems  trusted  with  critical  missions  to  have  a  keen  sensitivity  to  risk  and  to  be  able 
to  incorporate  uncertainty  into  their  decision-making. 

In  this  chapter,  we  address  the  problem  of  extracting  execution  policies  with 
risk  guarantees  from  contingent  plans  with  uncertainty.  The  current  practice  for 
ensuring  safety  in  these  missions  requires  groups  of  engineers  to  reason  over  a 
very  large  number  of  potential  decisions  and  execution  scenarios  that  might  un¬ 
fold  during  execution,  which  is  a  challenging,  time-consuming,  and  error  prone 
process.  Given  a  description  of  a  contingent  plan,  several  different  approaches 
in  the  literature  developed  notions  of  consistency  by  representing  uncertainty  as 
set-bounded  quantities,  i.e.,  as  intervals  of  values  with  no  associated  probability 
distribution  [16-19].  Nevertheless,  in  other  to  guarantee  feasibility  in  all  possi¬ 
ble  scenarios,  consistency-checking  algorithms  based  on  set-bounded  uncertainty 
end  up  performing  a  worst-case  analysis.  When  considering  situations  where  un¬ 
certainty  causes  small  plan  deviations  around  otherwise  “nominal”  values,  these 
set-bounded  consistency  criteria  work  well  and  output  robust,  albeit  conservative, 
execution  policies.  However,  they  have  difficulties  handling  problem  instances 
where  uncertainty  can  potentially  lead  the  system  to  very  hard  or  even  irrecover¬ 
able  scenarios,  often  returning  that  no  robust  execution  policy  exists.  This  is  most 
certainly  undesirable,  since  reasonable  amounts  of  risk  can  usually  be  tolerated 
for  the  sake  of  not  having  the  autonomous  agent  sit  idly  due  to  its  absolute  “fear” 
of  the  worst. 

The  work  presented  in  this  chapter  improves  upon  the  state-of-the-art  on  con¬ 
ditional  plan  execution  by  extending  the  notions  of  weak  and  strong  plan  consis¬ 
tency  to  a  risk-bounded  setting  and  providing  efficient  algorithms  for  determining 
(or  refuting)  them.  These  risk  bounds  are  also  known  as  chance-constraints  [20]. 
Weak  and  strong  consistency  are  useful  concepts  when  planning  missions  for 
agents  whose  embedded  hardware  has  very  limited  computation  and  telecom¬ 
munication  power,  making  it  hard  for  them  to  come  up  with  solutions  ‘on  the 
fly’  or  for  remote  operators  to  intervene  in  a  timely  fashion.  Chance-constrained 
weak  consistency  (CCWC)  is  a  useful  concept  for  missions  where  agents  operate 
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in  static  or  slow  changing  environments  after  an  initial  scouting  mission  aimed 
at  reducing  plan  uncertainty.  Chance-constrained  strong  consistency  (CCSC),  on 
the  other  hand,  removes  the  need  for  a  scouting  mission  and  tries  to  determine 
the  existence  of  a  solution  that,  with  probability  greater  than  some  threshold,  will 
succeed  irrespective  of  the  outcomes  of  uncertainty  in  the  plan.  Strong  consis¬ 
tency  is  clearly  more  conservative,  but  it  is  appealing  to  mission  managers  be¬ 
cause  strongly  consistent  policies  require  little  to  no  onboard  sensing  and  deci¬ 
sion  making,  greatly  reducing  the  agents’  complexity  and  costs.  They  also  reduce 
or  completely  eliminate  the  need  to  coordinate  between  multiple  agents.  Finally, 
the  robustness  of  a  strongly  consistent  policy  makes  it  easier  to  check  by  human 
operators  before  it  is  approved  for  upload  to  the  remote  agent. 

We  introduce  Probabilistic  Temporal  Plan  Networks  (pTPNs)  as  our  repre¬ 
sentation  of  contingent  plans.  Our  pTPN  representation  holds  a  lot  of  similari¬ 
ties  with  Temporal  Plan  Networks  with  Uncertainty  (TPNU)  [2, 17],  Conditional 
Temporal  Plans  (CTPs)  [16],  Disjunctive  Temporal  Problems  with  Uncertainty 
(DTPU)  [18],  and  the  Conditional  Simple  Temporal  Network  with  Uncertainty 
(CSTNU)  [19],  but  extends  them  in  two  important  ways.  First,  pTPNs  allow  un¬ 
controllable  choices  (discrete,  finite  domain  random  variables)  to  have  their  joint 
distributions  described  by  a  probability  model,  as  opposed  to  a  purely  set-bounded 
uncertainty  representation  in  DTPUs  and  CTPs.  Second,  pTPNs  allow  the  user  to 
specify  admissible  risk  thresholds  that  upper  bound  the  probability  of  violating 
sets  of  constraints  in  the  plan,  a  missing  feature  in  CTPs,  DTPUs,  and  TPNUs. 
The  latter  extension  is  an  important  improvement  of  pTPNs  over  previous  repre¬ 
sentations  when  modeling  many  real-world  problems,  where  risk-free  plans  that 
are  robust  to  all  possible  uncontrollable  scenarios  are  often  unachievable.  Our 
pTPNs  are  compiled  from  contingent  plans  descriptions  in  cRMPL  (Chapter  3). 


4.2  A  conflict-directed  approach  to  risk-bounded  plan 
consistency 

Our  algorithms  reason  quantitatively  about  the  probability  of  different  random 
scenarios  and  explore  the  space  of  feasible  solutions  efficiently  while  bounding 
the  risk  of  failure  of  an  execution  policy  below  a  user-given  admissible  thresh¬ 
old.  While  state-of-the-art  methods  in  the  conditional  and  stochastic  CSP  litera¬ 
ture  rely  on  a  combination  of  chronological  (depth-first)  search  and  inference  in 
the  space  of  contingencies  in  order  to  quickly  find  satisficing  solutions  [21-25], 
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in  this  work  we  introduce  a  “diagnostic”  approach  based  on  Conflict-Directed 
A*  (CDA*)  [26].  By  continuously  learning  subsets  of  conflicting  constraints  and 
generalizing  them  to  a  potentially  much  larger  set  of  pathological  scenarios,  our 
algorithms  can  effectively  explore  the  space  of  robust  policies  in  best-first  order 
of  risk  while  ensuring  that  it  is  within  the  user-specified  bound.  For  the  problem 
of  extracting  a  strongly  consistent  policy  from  a  contingent  plan  description,  our 
numerical  results  showed  significant  gains  in  scalability  for  our  approach. 

Here  we  motivate  the  usefulness  of  chance-constrained  consistency  on  a  very 
simple  commute  problem,  whose  pTPN  representation  is  given  in  Figure  4.1.  We 
start  at  home  and  our  goal  is  to  be  at  work  for  a  meeting  in  at  most  30  minutes.  Cir¬ 
cles  represent  the  start  and  end  events  of  temporally-constrained  activities  called 
episodes.  Simple  temporal  constraints  [14]  are  represented  by  arcs  connecting 
temporal  events  and  represent  linear  bounds  on  their  temporal  distance.  For  sim¬ 
plicity,  we  assume  that  we  are  given  only  three  possible  choices  in  this  pTPN:  we 
can  either  ride  a  bike  to  work,  drive  our  car,  or  stay  home  and  call  our  employer 
saying  that  we  will  not  be  able  to  make  it  to  work  today.  The  rewards  (R  values) 
associated  with  each  one  of  these  choices  in  Figure  4.1  correspond  to  how  much 
money  we  would  make  that  day  minus  the  cost  of  transportation.  Uncontrollable 
choices  are  depicted  in  Figure  4.1  by  double  circles  with  dashed  lines.  These  are 
random,  discrete  events  that  affect  our  plan  and  whose  probability  model  is  also 
given  in  Figure  4.1. 


Figure  4.1:  A  pTPN  for  a  simple  plan  to  get  to  work 


In  this  example,  the  uncontrollable  choices  model  what  might  “go  wrong” 
during  plan  execution  and  the  impact  of  these  unexpected  events  on  the  overall 
duration  of  the  plan.  For  example,  if  we  decide  to  ride  a  bike  to  work  (the  option 
with  the  highest  reward),  there  is  the  possibility  that  we  might  slip  and  fall.  This 
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event  has  a  minor  effect  on  the  duration  of  the  ride,  but  would  force  us  to  change 
clothes  at  our  workplace  because  we  cannot  spend  the  day  in  a  dirty  suit.  Since 
we  only  have  30  minutes  before  the  meeting  starts,  the  uncontrollable  event  of 
slipping  would  cause  the  overall  plan  to  be  infeasible.  A  similar  situation  happens 
if  we  choose  to  drive  our  car  and  happen  to  be  involved  in  an  accident. 

By  ignoring  probabilities  and  using  a  consistency  checker  for  the  pTPN  in 
Figure  4. 1  based  on  a  set-bounded  representation  of  uncertainty,  we  would  realize 
that  the  pTPN  is  guaranteed  to  be  consistent.  Unfortunately,  unless  we  had  a 
way  of  telling  ahead  of  time  whether  we  would  slip  from  the  bike  or  be  in  a  car 
accident,  the  suggested  policy  would  be  to  always  stay  home!  This  is  because, 
for  the  choice  of  riding  a  bike  or  driving  our  car  to  work,  there  are  uncontrollable 
scenarios  that  cause  the  plan  to  fail,  causing  the  set-bounded  consistency  checker 
to  fall  back  to  the  safe,  albeit  undesirable,  choice  of  staying  at  home.  This  clearly 
disagrees  with  our  common  sense,  since  people  try  to  achieve  their  goals  while 
acknowledging  that  uncontrollable  events  might  cause  them  to  fail.  Next,  we  show 
how  our  chance-constrained  approach  would  produce  execution  policies  that  agree 
with  what  we  would  expect  a  “reasonable”  agent  to  do. 

Let’s  consider  the  case  where  we  accept  that  our  plan  might  fail,  as  long  as 
the  risk  A  is  no  more  than  2%.  Given  that  riding  a  bike  is  the  option  with  the 
highest  reward,  our  algorithm  would  deem  bike  ridding  the  most  promising  and 
would  start  by  checking  if  choosing  to  ride  a  bike  meets  the  chance-constraint 
A  <  2%.  If  there  existed  a  feasible  activity  schedule  satisfying  the  temporal 
constraints  for  both  values  of  Slip,  we  could  pick  this  schedule  and  our  risk  of 
failure  would  be  zero,  which  is  clearly  less  than  our  risk  bound.  However,  our 
algorithm  concludes  that  the  scenario  Slip  =  True  is  inconsistent  with  the  overall 
temporal  constraint  of  arriving  at  the  meeting  in  less  than  30  minutes,  so  there 
must  exist  a  nonzero  risk  of  failure  in  this  case.  According  to  the  model  in  Figure 
4.1,  the  probability  of  having  slip  is  Pr (Slip)  =  5.1%,  so  riding  a  bike  does  not 
meet  the  chance-constraint  A  <  2%.  The  next  best  option  is  riding  a  car,  where 
now  we  are  subject  to  the  uncontrollable  event  of  being  in  a  car  accident.  Fol¬ 
lowing  a  similar  analysis,  we  conclude  that  the  risk  of  our  plan  being  infeasible 
in  this  case  is  Pr  (Accident)  =  1.3%,  which  meets  the  chance-constraint.  There¬ 
fore,  our  algorithm  would  advise  us  to  drive  to  work  within  the  temporal  bounds 
shown  in  Figure  4.1  for  the  case  where  no  accident  happens.  We  claim  that  this 
chance-constrained  type  of  reasoning  approximates  the  decision  making  process 
of  mission  managers  much  better  than  its  set-bounded  alternative,  since  operators 
need  to  commit  to  plans  with  acceptable  levels  of  risk  in  order  to  extract  some 
useful  output  from  the  remote  explorer. 
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It  is  worth  noticing  that  choosing  A  <  1.3  would  have  made  staying  at  home 
the  only  feasible  alternative.  Hence,  as  in  the  set  bounded  approach,  a  chance 
constraint  may  still  be  too  conservative  to  allow  for  a  feasible  solution.  Moreover, 
if  the  overall  temporal  constraint  of  30  minutes  in  Figure  4.1  were  relaxed  to  35 
minutes,  our  algorithm  would  have  been  capable  of  finding  a  risk-free  scheduling 
policy  for  its  first  choice  of  riding  a  bike.  This  is  because,  in  this  case,  there  would 
exist  a  feasible  schedule  satisfying  all  temporal  constraints  on  the  upper  side  of  the 
pTPN,  i.e.,  temporal  constraints  activated  by  both  Slip  =  True  and  Slip  =  False. 

This  example  highlights  a  few  key  elements  of  our  approach.  First,  we  divide 
the  problem  into  generating  a  candidate  “plan”  as  an  assignment  to  the  control¬ 
lable  choices  and  testing  the  plan  against  the  chance  constraints.  Second,  candi¬ 
dates  are  enumerated  in  best-first  order  based  on  reward.  Third,  testing  feasibility 
of  a  chance-constraint  is  a  fundamental  task,  in  which  estimating  risk  is  costly. 
We  frame  risk  estimation  as  a  process  of  enumerating  the  most  likely  sets  of  sce¬ 
narios  that  incur  and  do  not  incur  risk  (called  conflicts  and  kernels,  respectively). 
We  observe  that  this  can  be  formulated  as  a  symptom-directed,  “diagnostic”  pro¬ 
cess,  allowing  us  to  leverage  an  efficient,  conflict-directed  best-first  enumeration 
algorithm,  Conflict-Directed  A*  (CDA*)  [26],  to  generate  kernels  and  conflicts, 
and  hence  feasible  and  infeasible  scenarios.  A  detailed  description  of  this  work 
can  be  found  at  [1]. 
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Chapter  5 

Model-based  generation  of 
risk-aware  plans 


Similar  to  conventional  programming  languages,  using  cRMPL  (Chapter  3)  to 
specify  safe  autonomous  behavior  at  a  high  level  of  abstraction  is  done  declar- 
atively,  i.e.,  the  programmer  is  responsible  for  the  explicit  enumeration  of  all 
sequence  and  parallel  composition  of  episodes,  along  with  all  decisions  (control¬ 
lable  choices)  and  observations  (uncontrollable  choices)  in  the  program.  Once  it  is 
available,  this  program  can  be  checked  for  risk-bounded  temporal  consistency  us¬ 
ing  the  methods  described  in  Chapter  4.  There  are,  however,  two  important  caveats 
related  to  this  approach.  First,  the  computational  cost  of  checking  consistency  of 
cRMPL  programs  and  the  size  of  their  corresponding  pTPN’s  grow  exponentially 
with  the  number  of  choices.  Second,  in  application  domains  where  there  is  sig¬ 
nificant  flexibility  in  terms  of  the  decisions  an  autonomous  agent  can  make,  and 
the  number  of  possible  observations  that  it  might  receive  from  the  environment, 
thinking  through  the  sequential  act-observe-act —  process  declaratively  can  be 
too  challenging  of  a  task  for  a  human  programmer. 

This  chapter  introduces  Risk-bounded  AO*  (RAO*),  a  heuristic  forward  search 
algorithm  capable  of  automatically  generating  cRMPL  programs  from  chance- 
constrained  POMDP  (CC-POMDP)  models,  a  novel  variant  of  Partially  Observ¬ 
able  Markov  Decision  Processes  (POMDP’s)  that  we  propose  to  allow  autonomous 
agents  operating  under  uncertainty  to  optimize  expected  performance  while  bound¬ 
ing  the  risk  of  violating  safety  constraints.  In  the  development  of  RAO*  (detailed 
technical  description  available  in  Appendix  A),  we  perform  a  systematic  deriva¬ 
tion  of  execution  risk  in  POMDP  domains,  improving  upon  how  risk  was  pre¬ 
viously  handled  in  the  constrained  POMDP  (C-POMDP)  literature.  In  addition 
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to  the  utility  heuristic  used  to  guide  RAO*  towards  cRMPL  programs  with  bet¬ 
ter  performance,  our  algorithm  leverages  an  admissible  execution  risk  heuristic 
to  quickly  detect  and  prune  overly-risky  cRMPL  program  branches,  therefore  en¬ 
abling  their  early  pruning.  In  Chapter  7,  we  show  how  cRMPL  programs  gener¬ 
ated  with  RAO*  can  be  hierarchically  combined  within  user-specified  cRMPL  pro¬ 
grams  in  a  hybrid  declarative/generative  approach  for  describing  safe  autonomous 
behavior. 


5.1  The  need  for  handling  risk  in  planning  domains 
with  uncertainty 

Partially  Observable  Markov  Decision  Processes  (POMDPs)  [27]  have  become 
one  of  the  most  popular  frameworks  for  optimal  planning  under  actuator  and  sen¬ 
sor  uncertainty,  where  POMDP  solvers  find  policies  that  maximize  some  measure 
of  expected  utility  [28,29]. 

In  many  application  domains,  however,  performance  is  not  enough.  Critical 
missions  in  real-world  scenarios  require  agents  to  develop  a  keen  sensitivity  to 
risk,  which  needs  to  be  traded-off  against  utility.  For  instance,  a  search  and  res¬ 
cue  UAV  should  maximize  the  value  of  the  information  gathered,  subject  to  safety 
constraints  such  as  avoiding  dangerous  areas  and  keeping  sufficient  battery  levels. 
In  these  domains,  autonomous  agents  should  seek  to  optimize  expected  reward 
while  remaining  safe  by  deliberately  keeping  the  probability  of  violating  one  or 
more  constraints  within  acceptable  levels.  Unsurprisingly,  attempting  to  model 
risk  bounds  as  negative  rewards  leads  to  models  that  are  over-sensitive  to  the  par¬ 
ticular  penalty  value  chosen,  and  to  policies  that  are  overly  risk-averse  or  overly 
risk-taking  [30] .  Therefore,  to  accommodate  the  aforementioned  types  of  scenar¬ 
ios,  new  models  and  algorithms  for  constrained  MDPs  have  started  to  emerge, 
which  handle  chance  constraints  explicitly. 

Research  has  mostly  focused  on  fully  observable  constrained  MDPs,  for  which 
non-trivial  theoretical  properties  are  known  [31,32].  Existing  algorithms  cover  an 
interesting  spectrum  of  chance  constraints  over  secondary  objectives  or  even  exe¬ 
cution  paths,  e.g.,  [33-35].  For  constrained  POMDPs  (C-POMDP’s),  the  state  of 
the  art  is  less  mature.  It  includes  a  few  suboptimal  or  approximate  methods  based 
on  extensions  of  dynamic  programming  [36],  point-based  value  iteration  [37],  ap¬ 
proximate  linear  programming  [38],  or  on-line  search  [30].  Moreover,  the  model¬ 
ing  of  chance  constraints  through  unit  costs  in  the  C-POMDP  literature  has  a  num- 
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ber  of  shortcomings,  such  as  requiring  constraint  violations  to  be  fully  observable 
and  cause  program  execution  to  halt  immediately,  leading  to  conservatism. 


5.2  Searching  for  optimal,  risk-bounded  cRMPL  pro¬ 
grams 

Because  cRMPL  supports  programs  that  control  agents  with  hidden  state  (only 
partially  observable  through  sensing  actions),  RAO*  performs  its  search  for  a 
performance-maximizing,  risk-bounded  cRMPL  program  in  the  space  of  discrete 
belief  states.  Simply  put,  a  discrete  belief  state  consists  of  a  list  of  possible  hidden 
states  that  an  autonomous  agent  (and  its  environment)  might  be  at  a  given  point  in 
time,  along  with  its  associated  probability  (see  top  right  comer  of  Figure  5.1). 

Starting  from  an  initial  belief  state  b0,  RAO*  explores  the  space  of  belief  states 
using  heuristic  forward  search  by  keeping  track  of  two  graphs:  I)  an  explicit  search 
graph  G,  whose  nodes  represent  all  belief  states  explored  so  far;  and  II)  the  greedy 
graph  g,  which  is  the  subset  of  the  explicit  graph  that  currently  contains  our  best 
estimate  of  the  best-performing  cRMPL  program.  Figure  5.1  shows  the  tree  repre¬ 
sentation  of  an  “act-and-observe”  step  in  a  cRMPL  program,  and  Figure  5.2  shows 
how  the  information  in  Figure  5.1  can  be  represented  in  the  form  of  a  hypergraph. 
As  shown  in  Figure  5.2,  controllable  choices  in  cRMPL  are  represented  as  actions 
in  the  hypergraph.  Once  the  agent  performs  that  action,  the  agent  receives  one  of 
a  family  of  possible  sensor  inputs  (including  “no  input”).  These  correspond  to  un¬ 
controllable  choices  in  cRMPL,  and  are  represented  as  hyperedges  (associated  to 
that  particular  action)  in  the  hypergraph  where  RAO*  performs  its  search.  Figure 
5.2  shows  a  portion  of  the  explicit  search  graph  G  with  a  leaf  node  on  the  A  -th 
layer  being  expanded. 

Similar  to  AO*  [39],  RAO*  searches  for  performance-maximizing  cRMPL 
programs  by  propagating  optimistic  (upper  bound)  estimates  of  utility  recursively 
from  child  to  parent  nodes  in  its  explicit  graph  G.  In  addition  to  utility,  RAO* 
introduces  a  novel  heuristic  propagation  of  the  execution  risk  associated  with  a 
cRMPL  program,  therefore  allowing  it  to  compute  increasingly  precise  estimates 
of  how  likely  it  is  for  program  execution  to  deviate  from  safety  and  violate  one 
or  more  user-specified  constraints.  As  with  utility,  recursive  propagation  of  op¬ 
timistic  (lower  bounds,  in  this  case)  estimates  of  a  cRMPL  program’s  execution 
risk  are  performed  from  child  to  parent  nodes  in  the  explicit  graph  G  -  see  equa¬ 
tion  (12)  in  Appendix  A.  These  optimistic  risk  estimates  are  used  to  quickly  detect 
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belief  state  after 
(hidden)  state  transition. 


□ 


branching  due  to  observations 


Figure  5.1:  Detailed  representation  of  the  process  of  choosing  among  n  differ¬ 
ent  controllable  choices  in  an  cRMPL  program  (ai,  a2, . . . ,  an ),  and  subsequently 
receiving  one  of  m  possible  uncontrollable  choices  (oi,  o2, . . . ,  om).  Nodes  (cir¬ 
cles)  represent  belief  states,  which  are  in  turn  represented  as  lists  of  possible  hid¬ 
den  states  and  associated  probabilities.  Belief  states  after  performing  an  action 
are  called  prior  beliefs,  while  belief  states  that  incorporate  new  observations  are 
called  posterior  beliefs. 


overly  risk-taking  branches  in  a  cRMPL  program  during  the  search  process,  allow¬ 
ing  for  their  early  pruning  and  potential  significant  reduction  of  the  search  space. 
At  any  given  point  during  the  search,  the  greedy  graph  g  of  RAO*  represents  the 
current  estimate  of  a  contingent  cRMPL  that  maximizes  agent  performance  while 
strictly  abiding  to  the  risk-bounds  defined  by  chance  constraints. 

Appendix  A  presents  a  detailed  derivation  of  the  RAO*  algorithm,  along  with 
its  pseudo-code  and  thorough  numerical  evaluation. 
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Figure  5.2:  Representation  of  Figure  5.1  as  a  hypergraph  node  with  several  hyper¬ 
edges  associated  to  different  actions.  In  a  cRMPL  program,  actions  correspond 
to  controllable  choices  (decisions),  while  hyperedges  are  associated  with  the  pos¬ 
sible  observations  (uncontrollable  choices)  that  the  agent  might  receive  from  the 
environment  upon  executing  an  action. 
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Chapter  6 

Chance-constrained  optimal 
scheduling 


In  field  deployment,  there  is  often  uncertainty  about  the  exact  timing  of  events 
caused  by  actions  whose  durations  which  are  not  controllable  by  the  operator  and 
not  known  a  priori.  For  example,  while  the  nominal  flight  time  of  a  vehicle  may  be 
known,  the  actual  flight  time  varies  based  on  the  conditions  on  the  day.  A  reliable, 
robust  scheduling  scheme  is  thus  required  to  take  into  account  the  uncertainty  in 
durations  in  the  mission. 

The  algorithm  known  as  Picard  deals  with  the  risk-aware  scheduling  aspect 
of  the  overall  problem.  In  contrast  to  traditional  approaches,  we  use  probabilistic 
representations  of  uncertainty,  representing  the  values  of  uncontrollable  durations 
with  random  variables.  The  distributions  allow  us  to  reason  about  the  most  likely 
ranges  for  the  durations,  and  provide  schedules  which  will  meet  the  timing  re¬ 
quirements  with  probabilistic  guarantees  without  undue  conservatism. 


6.1  Problem  Statement 

In  field  deployment  on  critical  missions,  the  cost  of  failing  to  meet  timing  con¬ 
straints  is  often  difficult  to  quantify.  We  must  instead  provide  probabilistic  guar¬ 
antees  for  timeliness,  accounting  for  the  uncertain  durations.  In  addition  to  robust¬ 
ness  against  constraint  violation,  the  desirability  of  schedules  may  also  depend  on 
the  time  assignments:  the  quality  of  shallow  water  data  may  depend  on  the  collec¬ 
tion  time  due  to  the  tidal  cycle,  and  car  sharing  networks  may  require  the  inactive 
times  of  the  cars  to  be  low  to  maximize  use  of  assets. 

32 


DISTRIBUTION  A:  Distribution  approved  for  public  release. 


Descriptions  and  corresponding  solution  methods  for  such  problems  must  thus 
have  the  following  characteristics.  First,  the  description  must  allow  the  specifica¬ 
tion  of  an  utility  function  to  be  optimized.  Second,  the  description  must  recognize 
stochasticity  in  durations  with  a  probabilistic  representation.  Further,  the  problem 
description  must  allow  rich  expressions  of  constraints.  For  example,  we  must  be 
able  to  describe  requirements  between  the  timing  of  two  uncertain  events  when 
we  schedule  a  traversal  with  uncertain  duration  to  observe  natural  phenomena 
with  uncertain  timing.  The  scheduler  must  thus  maximize  utility  while  providing 
probabilistic  guarantees  of  compliance  with  requirements. 

We  thus  make  use  of  the  formalism  described  in  [7]  to  define  the  underlying 
network  describing  the  temporal  constraints. 

Definition  1.  (Probabilistic  STN)  Let: 

•  activated  time-points  b,  G  M  be  those  assigned  by  the  agent; 

•  received  time-points  e,  G  M  be  those  assigned  by  the  external  world; 

•  free  constraints  cxy  (Free)  be  constraints  of  type  (y  —  x)  G  [lxy ,  uxy  \ , 
where  x,  y  are  time  points;  and 

•  uncertain  duration  fuDnJ  dxy  :  — *  M  be  random  variables  describing 

the  difference  (y  —  x)  =  dxy  (cu),  where  y  is  a  received  time  point  and  x 
is  an  activated  time  point,  for  (Q,  T .  P)  a  probability  space  with  sample 
space  O,  a-algebra  T  and  measure  P. 

Then,  A f+  =  (A7,,  Xe,  Rc,  Rf)  defines  a  pSTN,  with 

•  Xb  =  {&i,...,  b b }  the  set  of  B  e  N  activated  time-points; 

•  Xe  —  {e\, ...,  eg}  the  set  of  E  e  N  received  time-points; 

•  Rc  =  {cilj1, ...,  cicjc}  the  set  ofC  G  N  Frees;  and 

•  Rd  =  {di ui, ...,  diGjG}  the  set  of  G  G  N  uDns; 

The  pSTN  allows  us  to  express  the  controllable  and  uncontrollable  events  in 
a  mission,  as  well  as  the  requirements  between  them  and  the  durations  which  are 
responsible  for  the  uncertainty  in  the  timing  of  events.  We  must  find  a  schedule 
which  optimizes  an  objective  function,  but  also  has  a  limited  probability  of  failing 
to  meet  the  constraints  encoded  in  the  pSTN.  More  formally,  we  must  solve  a 
chance-constrained  probabilistic  simple  temporal  problem  (cc-pSTP),  as  defined 
below. 
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Definition  2.  (Chance-constrained  pSTP) 

Given: 

•  Af+  =  (Xb,  Xe,Rc,  Rd),  a  pSTN; 

•  At  6  [0, 1],  an  upper  bound  on  the  risk  of  failure,  for  the  set  Rc  of  Frees; 
and 

•  V  :  — »  M,  an  objective  function  dependent  on  assignments  to  X},; 


Find: 


•  S*B  G  M/;,  a  schedule  of  X/,  minimizing  V; 

Subject  to: 

•  rRc{S*R  )  <  At,  the  probability  of  inconsistency  bounded  by  At; 


In  solving  a  cc-pSTP,  we  are  required  to  find  a  timetable  to  each  of  the  events 
whose  timings  we  can  control.  The  timetable  must  be  optimal  with  respect  to  a 
predefined  objective  function,  while  meeting  the  temporal  constraints  set  out  in 
the  pSTN  with  a  probability  greater  than  1  —  A*. 


6.2  Intuition  for  solution  method 

In  this  section  we  provide  an  intuition  for  the  solution  method.  For  the  full  solution 
method  and  the  theoretical  proofs  and  guarantees,  the  interested  reader  is  directed 
to  [7], 

In  general,  it  is  not  possible  to  guarantee  against  all  eventualities.  For  example, 
it  may  be  possible  that  a  vehicle  will  take  an  infinite  amount  of  time  to  arrive  at 
a  location,  having  broken  down  along  the  way.  The  key  idea  is  to  find  a  high 
probability  subset  of  scenarios,  and  provide  a  schedule  which  will  meet  all  timing 
constraints  for  the  subset  of  scenarios. 

We  limit  the  scenarios  we  consider  through  risk  allocation.  We  are  given  an 
upper  bound  on  the  probability  of  failure  to  meet  the  constraints.  We  can  consider 
this  the  total  risk  allowed.  We  can  spend  this  risk  on  the  tail  ends  of  the  uncertain 
durations  to  limit  the  range  of  outcomes.  By  allocating  the  risk  to  the  lower  tail  of 
an  uncertain  duration,  we  may  find  the  outcomes  for  which  the  cumulative  density 
function  matches  the  allocated  risk.  Then,  we  can  disregard  any  shorter  outcomes 
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Figure  6.1:  Risk  allocation  to  find  reasonable  ranges  of  outcomes  for  uncertain 
durations. 


for  the  uncertain  duration  -  they  can  be  considered  expectedly  short.  A  similar 
procedure  is  done  with  the  upper  bound. 

In  this  way,  we  can  define  the  unexpectedly  long  and  short  outcomes,  and  dis¬ 
regard  them  in  our  calculations.  We  are  allowed  to  do  this  because  the  combined 
probability  of  any  uncertain  duration  being  in  the  unexpected  regions  is  less  than 
that  allowed  by  the  operator.  We  have  thus  derived  set-bounded  uncertainty  for 
our  temporal  uncertainty.  We  are  then  able  to  call  upon  existing  literature  deal¬ 
ing  with  set-bounded  temporal  uncertainty,  for  example  in  [40].  This  allows  us  to 
encode  the  cc-pSTP  as  a  nonlinear  optimization  problem. 
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Chapter  7 

Experimental  validation 


This  chapter  presents  experimental  validation,  both  in  simulation  and  on  real  hard¬ 
ware,  of  the  integrated  components  of  Enterprise  explained  in  this  report.  Detailed 
numerical  analysis  of  the  separate  pieces  can  be  found  at  the  referenced  publica¬ 
tions,  or  at  the  appendices  accompanying  this  report.  Therefore,  in  this  chapter, 
we  focus  on  demonstrating  the  integrated  deployments  of  our  system,  and  evalu¬ 
ate  the  general  applicability  of  our  risk-aware  architecture  by  demonstrating  our 
algorithms  in  a  number  of  different  domains,  ranging  from  planetary  rovers  and 
autonomous  aircraft  to  robotic  manufacturing. 

The  experiments  in  this  section  are  organized  as  follows.  In  Section  7.1,  we 
start  by  showing  how  cRMPL  can  be  used  for  modeling  and  control  purposes  in 
a  planetary  exploration  application  involving  the  coordination  of  two  rovers  and 
a  satellite.  Next,  Section  7.2  shows  how  cRMPL  can  be  used  to  easily  describe 
contingent  plans  in  a  robotic  manufacturing  setting,  where  a  robot  is  actively  try¬ 
ing  to  adapt  to  its  human  co-worker  in  order  to  achieve  their  common  task  goals. 
Finally,  Section  7.3  shows  how  optimal  cRMPL  programs  with  sensing  actions 
can  be  generated  by  RAO*  and  used  to  implement  optimal,  safe  behavior  of  an 
autonomous  aircraft  tasked  with  finding  hidden  targets  of  interest. 


7.1  Vehicle  coordination  under  temporal  uncertainty 

Figure  7.1  depicts  the  planetary  exploration  scenario  involved  in  this  demonstra¬ 
tion.  Two  autonomous  rovers,  Spirit  and  Opportunity ,  are  tasked  with  the  ex¬ 
ploration  of  a  number  of  predetermined  locations  on  a  region  of  Mars  filled  with 
obstacles.  Once  a  rover  arrives  at  a  site,  it  must  perform  a  number  of  science- 
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gathering  activities  whose  temporal  durations  are  uncertain,  but  bounded  by  upper 
and  lower  bounds  hard-coded  in  the  rover’s  science  module.  There  is  also  uncer¬ 
tainty  related  to  the  traversal  times  between  locations,  and  these  are  represented  as 
random  variables  whose  distributions  are  predicted  based  on  the  distance  between 
the  sites  and  the  features  of  the  terrain  (Section  7.1.1  explains  how  these  uncer¬ 
tainty  models  can  be  learned  from  data).  Finally,  both  rovers  have  to  go  back  to  a 
relay  site  in  order  to  transmit  their  findings  to  an  orbiting  satellite,  which  will  be 
reachable  by  the  rovers’  antennas  within  a  known  time  window.  In  order  to  maxi¬ 
mize  throughput,  there  is  the  added  constraint  that  the  rovers  should  communicate 
approximately  at  the  same  time  with  the  satellite,  but  their  transmissions  should 
not  overlap.  Finally,  since  one  would  like  to  avoid  any  need  of  coordination  be¬ 
tween  the  two  rovers,  we  require  that  there  exists  a  precomputed  schedule  that 
satisfies  all  temporal  constraints  and  is  robust  to  the  uncertainty  in  the  temporal 
durations  coming  from  the  rover  model. 

Figure  7.2  shows  the  portion  of  cRMPL  code  describing  the  model  for  the 
different  actions  available  to  the  rovers,  along  with  their  temporal  durations  and 
hierarchical  composition.  The  PySuluRMPyL  object  implements  pSulu  [41,42], 
a  chance-constrained  path  planner  developed  in  our  group  and  called  from  within 
cRMPL  to  return  trajectories  that  are  safe  when  avoiding  obstacles.  These  tra¬ 
jectories  are  converted  into  traversal  episodes  using  cRMPL’s  composition  con¬ 
structs. 

The  cRMPL  code  in  Figure  7.3  describes  the  temporal  coordination  of  two 
science  rovers.  As  we  can  see  from  the  code,  Spirit  is  responsible  for  gathering 
information  about  minerals,  followed  by  a  visit  to  funny  rock  and  return  to  relay. 
In  parallel,  Opportunity  must  travel  to  the  distant  alien  lair,  perform  its  science, 
and  then  head  to  the  relay  location.  The  tc .relay  constraint  requires  Opportunity 
to  start  sending  its  data  no  more  than  10  seconds  after  Spirit  has  finished  trans¬ 
mitting,  and  there  is  an  overall  temporal  constraint  [1800,  2000]  representing  the 
200  second  window  during  which  the  satellite  will  be  visible.  A  risk  bound  of  1% 
is  placed  on  the  violation  of  either  of  these  two  constraints,  given  the  uncertainty 
associated  with  the  temporal  durations  in  this  plan.  In  Figure  7.4,  we  see  a  pTPN 
representation  of  the  cRMPL  program  in  Figure  7.3. 

The  traversal  episodes  from  to  region  shown  in  Figure  7.4  are  each  broken 
by  pSulu  into  10  intermediate  segments,  each  one  of  them  featuring  a  stochastic 
duration  represented  by  a  Gaussian  random  variable.  We  used  Picard  (Chapter  6) 
to  generate  a  risk-bounded,  strongly  controllable  (precomputed)  schedule  in  11.7 
seconds,  with  the  additional  optimization  that  the  schedule  commands  the  rovers 
to  initiate  their  activities  as  early  as  possible.  From  this  example,  one  should 
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Figure  7.1:  The  Mars  rover  scenario  featuring  two  robotic  scouts  ( Spirit  and  Op¬ 
portunity)  that  must  explore  different  regions  of  the  map  and  coordinate  their  com¬ 
munication  with  an  orbiting  satellite  at  a  relay  station.  It  is  implemented  in  the 
MobileSim  simulator. 


notice  how  easy  it  is  to  model  realistic  temporal  coordination  missions  in  cRMPL 
with  environmental  uncertainty  and  risk  bounds,  and  how  efficiently  Picard  is  able 
to  leverage  acceptable  risk  levels  to  compute  schedules  that  are  easy  to  verify,  are 
robust  to  uncertainty,  and  offer  hard  guarantees  that  all  mission  requirements  will 
be  met  with  high  probability  (at  least  99%,  in  this  example). 

7.1.1  Learning  uncertain  temporal  duration  models  from  data 

The  previously  presented  demonstration  assumes  knowledge  about  the  temporal 
uncertainty  associated  with  traversals,  such  as  the  one  shown  in  Figure  7.5.  There 
is  the  question,  nevertheless,  of  how  one  can  learn  these  traversal  distributions 
from  data,  so  that  they  can  be  incorporated  into  cRMPL  models. 
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class  Rover(  object ) : 

Simple  RMPyL  model  for  a  Mars  rover  . 

def  __ i  n  i  t  __(  self  ,  name )  : 
self.  name=name 

self . path.planner  =  PySuluRMPyL( ) 
def  go_to(  self  ,  start  ,  goal  ,  risk  ,  waypoints  =  10,time_horizon  =  200.0)  : 
Returns  the  episode  corresponding  to  the  vehicle  traveling  . 


self  .  rover_param  [  ’  chance_constraint  ’  ]=risk 

self  .  rover  _par  am  [  ’waypoints  ’]=waypoints 

self  .  rover _par am  [  ’  time.horizon  ’]=time_horizon 

goto.ep  =  self . path.planner . plan_episode( start_state  =  start  +  (0.0  ,0.0)  , 

goal_state  =  goal +(0.0  ,0.0)  , 
parameters  =  s e If  .  rover.param  , 
duration_type  =  ’gaussian  ’  , 
agent=self . name) 

return  goto.ep 
def  perform.science  (  self  )  : 

Returns  the  episode  corresponding  to  the  vehicle  performing  science 
experiments  . 


ps_ep  =  sequence  .composition  ( 

Episode(duration={' ctype’:  ’uncontrollable.bounded’  ,’lb’:9,  ’ ub ’ : 1 1 }  , 
action  =  ’(  drill  %s  ) ’  %( s  e 1 f .  name ) )  , 

Episode(duration={  ’  ctype  ’  :  ’  uncontrollable.bounded  ’  ,  ’lb  ’  :  1  0  ,  ’ub’  :  1 5  }  , 
action  =  ’  (  collect  %s)  ’%(self  .  name) )  , 

Episode(duration={ 'ctype’:  ’controllable  ’  ,’lb’:5,  ’ub’:30}, 
action  =  ’  (process  %s)  ’  %( s  e  1  f  .  name ) ) ) 

return  ps.ep 
def  relay  (  self )  : 

Returns  the  episode  representing  the  rover  sending  data  back  to  a  satellite. 


rel_ep=  Episode(  duration  ={  ’ctype  ’  :  ’  controllable  ’  ,  ’lb  ’  :5  ,  ’ub’  :30}, 
action  = ’(  relay  %s  )  ’  %( s e  1  f  .  name ) ) 

return  rel.ep 


Figure  7.2:  cRMPL  class  modeling  the  actions  of  a  science  retrieval  rover. 
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1  o c  ={  start  ’  : ( 8 . 7 5  1  ,—8.625)  minerals  ’  :(0.0  ,  —  10.0)  , 

’funny.rock  ’  :(— 5.0,— 2.0)  ,  ’  relay  ’  :(0.0  ,0.0)  ,  ’  a  1 i e  n .1  ai r  ’  :(0.0 ,10.0)} 

rovl  =  Rover  (name=  ’  s  p  i  ri  t  ’ ) 
rov2  =  Rover  (name=  ’  opportunity  ’ ) 

prog  =  RMPyL(name=  ’  run  ()  ’ ) 
prog  *=  prog . parallel  ( 

prog . sequence ( 

rovl  .  go_to(  start  =  loc  [  ’  start  ’  ]  ,goal  =  loc  [  ’  minerals  ’]  ,  risk=0.01)  , 
rovl  .  perform.science  ()  , 

rovl  .  go_to(  start  =  loc  [  7  minerals  ’  ]  ,  goal  =  loc  [  ’funny_rock  ’  ]  ,  risk=0.01)  , 
rovl  .perform_science()  , 

rovl  .  go_to(  start  =  loc  [  ’funny.rock  ’  ]  ,  goal  =  loc  [  ’  relay  ’  ]  ,  risk=0.01)  , 
rovl  .  relay(ep_id=rovl  .  name+  ’  _relay  ’ ) )  , 
prog . sequence ( 

rov2  .  go_to(  start  =  loc  [  ’  start  ’  ]  ,goal  =  loc  [  ’  a  1  i  e  n  _1  ai  r  ’  ]  ,  risk=0.01)  , 
rov2  .  perform.science  ()  , 

rov2  .  go_to(  start  =  loc  [  ’  alien  .lair  ’  ]  ,  goal  =  loc  [  ’  relay  ’  ]  ,  risk=0.01)  , 
rov2 . relay ( ep_id  =  rov2 . name+ ’  .relay ’))) 

rl.rel  =  prog  .  episode.by  _id  ( rovl  .  name+  ’  .relay  ’ ) 
r2_rel  =  prog  .  episode.by  .id  ( rov2  .  name+ ’  .relay  ’ ) 

tc.relay  =  TemporalConstraint  (  start  =  r  1  _rel  .  end  ,  end=r2_rel  .  start  , 

ctype=  ’  controllable  ’  ,  lb  =0.0  , ub  =  1 0.0) 
prog  .  add_temporal_constraint(  tc.relay  ) 

tc  =  prog.  add_overall_temporal_constraint(ctype=’  controllable  ’  ,1b  =  1800. 0,ub  =2000.0) 
cc.time  =  ChanceConstraint(  const  raint_scope=[tc  ,  tc.relay  ],risk=0.1) 
prog  .  add.chance.constraint  (cc.time) 


Figure  7.3:  Program  in  cRMPL  describing  the  temporal  coordination  between  two 
science-retrieval  agents. 


q; 


-  -j 


~o 


Figure  7.4:  pTPN  representation  of  the  control  script  in  Figure  7.3. 
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Figure  7.5:  Example  of  a  traversal  generated  by  pSulu,  a  chance-constrained  path 
planner. 


In  this  demonstration,  we  made  the  assumption  that  traversal  times  between 
waypoints,  as  the  ones  shown  in  Figure  7.5,  can  be  accurately  predicted  as  a  sim¬ 
ple  function  of  the  Euclidean  distance  between  them  (length  of  the  straight  line 
connecting  the  two  regions  in  space).  To  be  more  precise,  we  assumed  the  linear 
model  d(l)  =  al+b,  where  l  is  the  length  of  the  line  connecting  the  two  waypoints; 
d  is  the  duration  of  the  traversal;  and  a  and  b  are  unknown  parameters.  In  order  to 
estimate  a  and  b,  we  simulated  hundreds  of  different  traversals  in  the  environment 
shown  in  Figure  7.1  and  measured  the  time  it  took  our  path-following  controller 
to  drive  a  rover  between  those  locations.  For  the  i-th  traversal  performed  out  of  a 
total  of  N,  we  recorded  the  pair  (/,,  d,).  Then,  we  chose  a,  b,  our  best  estimates  of 
the  parameters  a  and  b,  according  to 


N 

a,  b  —  arg  min  \  ( ali  +  b  —  di )2,  (7.1) 

a.b  ^ J 
i= 1 


i.e.,  we  chose  a  and  b  so  as  to  minimize  the  variance  of  the  predictor  d(l)  =  al  +  b. 
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In  order  to  estimate  the  variance  of  the  prediction  d(l)  for  an  arbitrary  length  l,  we 
compute  the  empirical  variance  of  the  samples  (/*,  d, )  in  a  neighborhood  around  l. 
Figure  7.6  shows  one  such  model  learned  with  (7.1)  from  simulated  data,  where 
the  upper  and  lower  3a  bounds  for  the  variance  are  computed  based  on  the  worst- 
case  empirical  variance  encountered  on  the  dataset.  Note  that  the  exact  same 
procedure  could  be  used  if  real  traversal  data  were  available,  therefore  allowing 
(7.1)  to  be  used  with  both  real  and  synthetic  data  combined. 


Figure  7.6:  Stochastic  traversal  time  model  learned  from  data.  The  horizontal 
axis  represents  the  Euclidean  distance  between  waypoints,  while  the  vertical  axis 
represents  the  traversal  time.  The  upper  and  lower  3a  bounds  are  computed  based 
on  the  worst-case  empirical  variance  encountered  on  the  dataset. 


7.2  Baxter  cooperative  manufacturing 

In  this  integrated  demo,  we  show  how  a  non-expert  user  can  interact  with  the  Bax¬ 
ter  robot  in  a  collaborative  task  where  the  robot  is  constantly  adapting  to  its  human 
coworker,  therefore  requiring  plans  with  embedded  contingencies.  In  the  proof  of 
concept,  the  red  and  green  blocks  must  be  moved  to  their  appropriate  locations,  as 
shown  in  Figure  7.7.  At  every  given  point  in  time,  the  robot  observes  the  current 
state  of  the  manufacturing  task  and  acts  accordingly.  The  videos  at  http:  /  / 
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Figure  7.7:  The  Baxter  workspace  setup. 


people . csail . mi t . edu/psant ana/public/ videos /AFOSR/  show 
two  situations: 

•  in  the  nominal  run,  the  human  does  not  interfere  with  the  robot,  which  pro¬ 
ceeds  to  pick  and  place  the  blocks  in  sequence  according  to  the  execution 
policy; 

•  in  the  second  run,  the  human  helps  the  robot  and  places  the  green  block  at 
its  destination.  The  robot  senses  this,  and  looks  at  the  policy  derived  for  the 
sensed  system  state,  resulting  in  a  pick  and  place  of  the  red  block. 

The  program  implementing  such  behavior  is  shown  in  Figure  7.8,  along  with 
its  pTPN  representation  in  Figure  7.9. 

The  user  programs  the  specifications  and  a  temporally  consistent  and  chance- 
constrained  execution  policy  is  generated  by  the  system.  This  is  given  as  an  input 
to  Enterprise's  executive  known  as  Pike  [10].  Further,  the  demonstration  shows 
that  the  system  executes  in  real-time.  The  derived  policy  with  timing  constraints 
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def  say ( text ) : 

”””  Final  message  to  the  human.””” 

return  Episode  (  action  =(’(  say  \”%s \”)  ’%text ) ) 

def  pick_and_place_block  (  prog  ,  block  ,  pick_loc  ,place_loc  ,  manip  ,  agent )  : 

’’’’’’’’Picks  a  block  and  places  it  somewhere.””” 
obj  =  block+'  Component  ’ 
return  prog . sequence ( 

say  (’Going  to  pick  %s  ’%obj )  , 

Episode(action=(’(pick  %s  %s  %s  %s)’%(obj,  manip  ,pick_loc  ,  agent))), 
Episode(  action  =(  ’(place  %s  %s  %s  %s)’  %(obj  ,  manip  ,place_loc  ,  agent)))) 

def  observe_and_act( prog  ,  blocks  ,  manip  ,  agent )  : 

”””  Robot  observes  human  and  acts  accordingly  .  ””” 
if  len  (  blocks  )  >0: 

#Human  helped  with  one  of  the  blocks 
human_help  =  [observe_and_act( 
prog  , 

[ob  for  ob  in  blocks  if  ob !  =  b ],  manip , agent )  for  b  in  blocks] 
#No  help  from  the  human 

no_human_help  =  prog  .  sequence  (pick_and_place_block( prog  ,  blocks  [0]  , 

blocks [0] +  ’  Bin  ’  , 
blocks  [0]  +  ’  Target  ’  , 
manip  ,  agent )  , 

observe_and_act(  prog  ,  blocks  [  1  :  ]  ,  manip  ,  agent ) ) 

#A11  episodes 

all.episodes  =  human.help  +  [  no_human_help  ] 

#Observe  each  one  of  the  blocks 
observations  =  blocks+[  ’  none  ’  ] 

return  prog  .  sequence(prog  .  observed  ’  name  ’  :  ’  observe —human— %d  ’%(len(blocks))  , 

’ctype  ’  :  ’uncontrollable  ’  , 

’  domain  ’  :  observations}, 

*  all.episodes  ) ) 

else  : 

return  say(’All  done!’) 

######  Control  program  starts  here 

blocks  =[  ’  Red  ’  ,  ’  Green  ’  ] 

agent=  ’  Baxter  ’ 

manip=’  BaxterRight  ’ 

prog  =  RMPyL(name=  ’  run  ()  ’  ) 

prog  *=  prog . sequence ( say (’ Should  I  start?’), 

prog  .  observed  '  name  ’  :  ’  observe— human— %d  ’%(len(blocks))  , 
’ctype  ’  :  ’  uncontrollable  ’  , 

’domain’  :[  ’YES’  ,’NO’]}, 

observe_and_act(  prog  ,  blocks  ,  manip  ,  agent )  , 
say  (  ’  All  done  !  ’ ) ) ) 


Figure  7.8:  Complete  cRMPL  program  used  to  implement  the  collaborative  man¬ 
ufacturing  demonstration. 
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is  given  in  Figure  7.9.  At  every  stage,  the  policy  maps  from  the  state  of  the  world 
to  the  action  the  robot  should  perform  in  the  next  time  step,  including  querying 
for  further  observations. 

An  intuition  for  understanding  the  policy  would  be  to  consider  it  as  a  sequence 
of  IF  statements.  However,  the  policy  also  allows  reasoning  over  timing  con¬ 
straints,  otherwise  difficult  to  capture  with  conventional  control  structures  such  as 
IF  statements.  Further,  as  shown  by  visual  inspection,  the  actual  policy  for  the 
problem  is  too  tedious  to  actually  encode  by  hand,  even  for  expert  users.  Using 
our  system,  the  user  was  able  to  describe  the  desired  behavior  in  this  case  with 
simple  statements  specifying  the  end  location  of  the  blocks,  and  the  risk-aware 
system  automatically  generated  the  policy.  This  allows  ease  of  use  by  non-experts, 
one  of  the  desired  features  of  a  deployable  risk-aware  autonomous  planning  and 
scheduling  system. 


7.2.1  Hybrid  model  learning  in  support  of  plan  execution 

When  executing  plans  on  real  hardware,  as  in  the  manufacturing  demonstration 
presented  in  this  section,  one  no  longer  has  full  knowledge  about  the  state  of  the 
environment,  as  it  is  usually  the  case  in  simulations.  Instead,  one  must  resort  to 
different  types  of  sensors,  along  with  models  of  the  environment,  in  order  to  be 
able  to  generate  more  human-understandable  predicates  such  as  “green  block  at 
its  location”  or  ’’human  has  the  red  block  on  their  hand”,  which  are  used  during 
the  planning  and  dispatching  phases.  In  order  to  address  this  need,  our  group  has 
developed  LCARS,  a  predicate  estimator  based  on  Qualitative  Spatial  Reason¬ 
ing  (QSR)  [43]  and  Probabilistic  Hybrid  Automata  (PHA)  [44,45]  models  of  the 
environment. 

Similar  to  Section  7.1.1,  one  should  ask  the  question  of  where  one  must  ac¬ 
quire  these  models  in  order  to  perform  state  estimation  of  the  surrounding  en¬ 
vironment  and  its  agents.  Towards  that  effort,  we  have  proposed  in  [8]  the  first 
data-driven  algorithm  capable  of  learning  PHA  models  directly  from  experimen¬ 
tal  data  in  the  context  of  this  project.  The  derivation  of  the  algorithm  is  rather 
involved,  but  we  show  in  our  work  that  such  PHA  models  can  greatly  enhance  the 
performance  of  state  estimators  of  complex  systems,  such  as  maneuvering  aircraft 
and  engineered  systems  with  switched  dynamics.  We  are  currently  in  the  pro¬ 
cess  of  applying  the  same  techniques  to  LCARS,  so  that  it  can  learn  QSR  models 
directly  from  experimental  data,  with  little  to  no  human  supervision. 
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7.3  Automatic  risk-aware  program  generation  with 
RAO* 

In  this  section,  we  show  how  RAO*  (Chapter  5)  is  able  to  automatically  gener¬ 
ate  cRMPL  programs  implementing  safe,  optimal  behavior  from  a  CC-POMDP 
model  description.  We  demonstrate  it  in  the  aerial  scout  scenario  depicted  in 
Figures  7.10  and  7.11  featuring  a  Cessna  172P  aircraft  in  the  open-source  flight 
simulator  FlightGear. 

The  blue  regions  in  Figure  7.10  represent  areas  of  interest  that  might  con¬ 
tain  targets  of  various  levels  of  importance.  The  confidence  about  the  presence 
or  absence  of  these  targets  in  each  one  of  the  areas  is  given  by  a  prior  probabil¬ 
ity  distribution,  which  is  part  of  the  model  given  to  RAO*.  The  gray  regions  in 
Figure  7.10  are  no-fly  zones  that  the  aircraft  should  avoid,  and  we  use  chance 
constraints  to  bound  the  probability  of  the  aircraft  inadvertently  entering  any  one 
of  them.  When  the  aircraft  flies  over  a  region,  it  will  discover  a  target  with  high 
probability  should  one  be  present,  in  which  case  it  gains  information  reward.  The 
risk-bounded  path  planning  problem  in  this  demonstration  is  solved  by  pSulu,  the 
aforementioned  chance-constrained  path  planner  developed  in  our  group. 
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Figure  7.10:  Mission  map  on  AutoFG,  the  autopilot  used  to  fly  the  Cessna  in 
FlightGear.  Blue  regions  are  sites  of  interest  that  should  be  visited,  time  and  risk 
permitting.  Gray  areas  are  no-fly  zones  that  the  aircraft  should  avoid. 


47 


DISTRIBUTION  A:  Distribution  approved  for  public  release. 


Figure  7.11:  Screenshot  of  the  FlightGear  flight  simulator,  featuring  a  Cessna 
172P  aircraft. 


Following  the  Enterprise  architecture  described  in  Chapter  2,  cRMPL  pro¬ 
grams  are  compiled  into  a  Probabilistic  Temporal  Plan  Network  format  and  dis¬ 
patched  by  Pike.  For  illustration  purposes,  Figure  7.12  shows  a  pTPN  correspond¬ 
ing  to  an  over-constrained  situation  where  the  aircraft  only  has  enough  resources 
to  visit  a  single  site,  while  Figure  7.13  shows  a  snapshot  of  a  cRMPL  program  gen¬ 
erated  by  RAO*  being  dispatched  by  Pike  within  the  Enterprise  architecture.  The 
full  video  can  be  found  at  http :  /  / people  .  csail .  mit .  edu/psantana/ 
public/ videos /AFOSR/. 


Figure  7.12:  pTPN  representation  of  an  cRMPL  program  with  enough  resources 
to  visit  a  single  site. 
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Figure  7.13:  Snapshot  of  a  cRMPL  program  generated  by  RAO*  being  dispatched 
by  Pike  within  the  Enterprise  architecture. 


A  more  detailed  analysis  of  RAO*’s  computational  report  can  be  found  in 
Appendix  A. 
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Chapter  8 
Conclusions 


This  project  developed  Enterprise ,  a  model-based  programming  and  execution  ar¬ 
chitecture  allowing  safe  autonomous  behavior  to  be  specified  at  a  natural  level  of 
abstraction,  and  contingent  plans  that  are  robust  to  environmental  uncertainty  to 
be  synthesized,  verified,  and  dispatched  in  real-time  while  ensuring  good  perfor¬ 
mance  and  hard  guarantees  on  the  risk  of  failure. 

The  chapters  in  this  report  and  referenced  publications  describe  our  efforts 
in  developing  formal  methods  to  accomplish  the  goals  set  forth  in  the  desiderata, 
and  our  demonstrations,  both  in  simulations  and  in  real  hardware,  confirm  that  our 
chance-constrained  architecture  meets  the  desired  goals  of  the  project.  All  goals 
in  our  original  Statement  of  Objectives  were  met  according  to  plan  or  extended, 
and  are  briefly  summarized  below  for  convenience: 

Objective  1  :  create  a  chance-constrained,  reactive  model-based  programming 
language  (cRMPL)  for  specifying  mission  goals  and  risk  levels. 

Objective  2  :  develop  a  risk- sensitive  executive  for  cRMPL  that  uses  its  action 
model  to  plan  an  execution  that  maximizes  expected  utility,  while  respecting 
all  chance  constraints. 

Objective  3  :  validate  cRMPL  and  Enterprise  on  a  UAV  mission  (flight  simulator 
or  indoor  quadcopters),  spacecraft  mission  (MIT  Spheres  spacecraft)  or  a 
robot  logistics  mission. 

The  cRMPL  language  described  in  Chapter  3  and  empirically  demonstrated  in 
Chapter  7  allows  human  operators  to  specify  safe  desired  behavior  at  a  high  level 
of  abstraction.  It  extends  the  previous  version  of  RMPL  by  adding  support  to  state 
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and  temporal  uncertainty,  as  well  as  safety  guarantees  in  the  form  of  chance  con¬ 
straints,  while  retaining  RMPL’s  original  capability  of  representing  complex  hier¬ 
archical  compositions  of  plan  episodes  and  their  parallel  coordination.  Moreover, 
the  choice  of  implementing  cRMPL  as  a  module  of  the  general-purpose  Python 
language  allows  it  to  be  integrated  within  modern  robotic  frameworks,  such  as  the 
Robot  Operating  System  (ROS).  The  latter  was  a  key  feature  that  enabled  cRMPL 
to  control  the  Baxter  robot  in  our  robotic  manufacturing  demonstration  in  Chapter 
7. 

Chapters  4  through  6  and  their  accompanying  peer-reviewed  publications  de¬ 
scribe  the  different  components  required  to  accomplish  the  second  objective.  The 
formal  methods  and  algorithms  developed  provide  the  user  of  Enterprise  with  a 
wide  range  of  tools  for  checking  contingent  plans  against  safety  specifications; 
generating  optimal  and  safe  contingent  plans  from  a  model  description  of  the 
agent  and  its  environment;  and  producing  activity  schedules  that  are  robust  not 
only  to  uncertainties  in  the  timing  of  different  actions,  but  also  to  the  need  of  co¬ 
ordination  of  different  autonomous  agents  under  control.  Our  demonstrations  in 
Chapter  7  confirm  that  these  pieces,  when  integrated  together  within  the  Enter¬ 
prise  framework,  allow  autonomous  agents  to  exhibit  safe  and  optimal  behavior 
while  operating  in  partially-known  environments.  Moreover,  we  provide  refer¬ 
ences  and  intuitions  related  to  our  work  in  learning  task  and  environmental  mod¬ 
els  from  experimental  and  simulated  data,  which  was  carried  out  in  support  of 
enabling  Enterprise  to  be  demonstrated  in  real-world  settings. 

Concerning  the  experimental  validation  in  the  third  objective,  we  demon¬ 
strated  Enterprise  both  in  simulated  UAV  missions  (FlightGear),  as  well  as  using 
real  quadcopters  in  the  undergraduate  course  described  in  Chapter  2.  We  have 
also  shown  its  usefulness  in  a  robotic  manufacturing  test  bed  purchased  under  this 
contract.  The  Enterprise  architecture  is  essentially  unchanged  among  all  sections 
in  Chapter  7,  as  well  as  Chapter  2:  the  planner  and  scheduler  produce  an  execution 
policy,  which  is  dispatched  in  real-time  by  Pike.  This  architecture  is  thus  trans¬ 
ferable  and  shown  to  be  easy  to  use  by  non-experts,  having  been  demonstrated  on 
vastly  different  hardware  and  software. 
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Appendix  A 

RAO*:  an  Algorithm  for 
Chance-Constrained  POMDP’s 


This  appendix  contains  a  detailed  description  and  experimental  evaluation  of  Risk- 
aware  AO*  (RAO*),  which  was  introduced  in  Chapter  5  in  the  context  of  model- 
based  generation  of  cRMPL  programs.  It  is  currently  under  review  for  publication 
at  the  30th  Conference  on  Artificial  Intelligence  (AAAI16). 
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Abstract 

Autonomous  agents  operating  in  partially  observable 
stochastic  environments  often  face  the  problem  of  op¬ 
timizing  expected  performance  while  bounding  the 
risk  of  violating  safety  constraints.  Such  problems 
can  be  modeled  as  chance-constrained  POMDP's  (CC- 
POMDP’s).  Our  first  contribution  is  a  systematic  deriva¬ 
tion  of  execution  risk  in  POMDP  domains,  which  im¬ 
proves  upon  how  chance  constraints  are  handled  in 
the  constrained  POMDP  literature.  Second,  we  present 
RAO*,  a  heuristic  forward  search  algorithm  producing 
optimal,  deterministic,  finite-horizon  policies  for  CC- 
POMDP’s.  In  addition  to  the  utility  heuristic.  RAO* 
leverages  an  admissible  execution  risk  heuristic  to 
quickly  detect  and  prune  overly-risky  policy  branches. 
Third,  we  demonstrate  the  usefulness  of  RAO*  in  two 
challenging  domains  of  practical  interest:  power  supply 
restoration  and  autonomous  science  agents. 

1  Introduction 

Partially  Observable  Markov  Decision  Processes  (POMDPs) 
(Smallwood  and  Sondik  1973)  have  become  one  of  the  most 
popular  frameworks  for  optimal  planning  under  actuator 
and  sensor  uncertainty,  where  POMDP  solvers  find  policies 
that  maximize  some  measure  of  expected  utility  (Kaelbling, 
Littman,  and  Cassandra  1998;  Silver  and  Veness  2010). 

In  many  application  domains,  however,  performance  is 
not  enough.  Critical  missions  in  real-world  scenarios  require 
agents  to  develop  a  keen  sensitivity  to  risk,  which  needs 
to  be  traded-off  against  utility.  For  instance,  a  search  and 
rescue  UAV  should  maximize  the  value  of  the  information 
gathered,  subject  to  safety  constraints  such  as  avoiding  dan¬ 
gerous  areas  and  keeping  sufficient  battery  levels.  In  these 
domains,  autonomous  agents  should  seek  to  optimize  ex¬ 
pected  reward  while  remaining  safe  by  deliberately  keeping 
the  probability  of  violating  one  or  more  constraints  within 
acceptable  levels.  A  bound  on  the  probability  of  violating 
constraints  is  called  a  chance  constraint  (Birge  and  Lou- 
veaux  1997).  Unsurprisingly,  attempting  to  model  chance 
constraints  as  negative  rewards  leads  to  models  that  are  over¬ 
sensitive  to  the  particular  penalty  value  chosen,  and  to  poli¬ 
cies  that  are  overly  risk-averse  or  overly  risk-taking  (Un- 
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durti  and  How  2010).  Therefore,  to  accommodate  the  type 
of  scenarios  exemplified  above,  new  models  and  algorithms 
for  constrained  MDPs  have  started  to  emerge,  which  handle 
chance  constraints  explicitly. 

Research  has  mostly  focused  on  fully  observable  con¬ 
strained  MDPs,  for  which  non-trivial  theoretical  properties 
are  known  (Altman  1999;  Feinberg  and  Shwarz  1995).  Ex¬ 
isting  algorithms  cover  an  interesting  spectrum  of  chance 
constraints  over  secondary  objectives  or  even  execution 
paths,  e.g.,  (Dolgov  and  Durfee  2005;  Hou,  Yeoh,  and 
Varakantham  2014;  Teichteil-Konigsbuch  2012).  For  con¬ 
strained  POMDPs  (C-POMDP’s),  the  state  of  the  art  is  less 
mature.  It  includes  a  few  suboptimal  or  approximate  meth¬ 
ods  based  on  extensions  of  dynamic  programming  (Isom, 
Meyn,  and  Braatz  2008),  point-based  value  iteration  (Kim 
et  al.  2011),  approximate  linear  programming  (Poupart  et 
al.  2015),  or  on-line  search  (Undurti  and  How  2010).  More¬ 
over,  as  we  later  show,  the  modeling  of  chance  constraints 
through  unit  costs  in  the  C-POMDP  literature  has  a  number 
of  shortcomings. 

Our  first  contribution  is  a  systematic  derivation  of  the  ex¬ 
ecution  risk  in  POMDP  domains,  and  how  it  can  be  used 
to  enforce  different  types  of  chance  constraints.  Second, 
we  present  Risk-bounded  AO*  (RAO*),  a  new  algorithm 
for  solving  chance-constrained  POMDPs  (CC-POMDPs) 
that  harnesses  the  power  of  heuristic  forward  search  in 
belief  space  (Washington  1996;  Bonet  and  Geffner  2000; 
Szer,  Charpillet,  and  Zilberstein  2005;  Bonet  and  Geffner 
2009).  Similar  to  AO*  (Nilsson  1982),  RAO*  guides  the 
search  towards  promising  policies  w.r.t.  reward  using  an  ad¬ 
missible  heuristic.  Third,  RAO*  leverages  a  second  admis¬ 
sible  heuristic  to  derive  and  propagate  execution  risk  upper 
bounds  at  each  search  node,  allowing  it  to  identify  and  prune 
overly  risky  paths  as  the  search  proceeds.  Last,  we  demon¬ 
strate  the  usefulness  of  RAO*  in  two  risk-sensitive  domains 
of  practical  interest;  automated  power  supply  restoration 
(PSR)  and  autonomous  science  agents  (SA). 

RAO*  returns  policies  that  maximize  the  expected  cumu¬ 
lative  reward  among  the  set  of  deterministic,  finite-horizon 
policies  satisfying  the  chance  constraints.  Even  though  op¬ 
timal  policies  for  CC-(PO)MDPs  may,  in  general,  require 
some  limited  amount  of  randomization  (Altman  1999),  we 
follow  Dolgov  and  Durfee  (2005)  in  deliberately  developing 
an  approach  restricted  to  deterministic  policies.  This  is  mo- 
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tivated  by  the  fact  that  users  rarely  trust  stochastic  decisions 
when  dealing  with  safety-critical  applications. 

The  paper  is  organized  as  follows.  Section  2  formulates 
the  type  of  CC-POMDPs  we  consider,  and  details  how 
RAO*  computes  execution  risks  and  propagates  risk  bounds 
forward.  Next,  Section  3  discusses  shortcomings  related  to 
the  treatment  of  chance  constraints  in  the  C-POMDP  liter¬ 
ature.  Section  4  presents  the  RAO*  algorithm,  followed  by 
our  experiments  in  Section  5,  and  conclusions  in  Section  6. 


2  Problem  formulation 


When  the  true  state  of  the  system  is  hidden,  one  can  only 
maintain  a  probability  distribution  (a.k.a.  belief  state)  over 
the  possible  states  of  the  system  at  any  given  point  in  time. 
Many  applications  in  which  an  agent  is  trying  to  act  under 
uncertainty  while  optimizing  some  measure  of  performance 
can  be  adequately  framed  as  instances  of  Partially  Observ¬ 
able  Markov  Decision  Processes  (POMDP)  (Smallwood  and 
Sondik  1973).  Here,  we  focus  on  the  case  where  there  is  a 
finite  policy  execution  horizon  h,  after  which  the  system  per¬ 
forms  a  deterministic  transition  to  an  absorbing  state. 
Definition  1  (Finite-horizon  POMDP).  A  FH-POMDP  is  a 
tuple  H  =<  S,  A,0,T,0,  R,bo,h  >,  where  S  is  a  set 
of  states;  A  is  a  set  of  actions;  O  is  a  set  of  observations; 
T  :  5  X  A  x  S  — >  R  is  a  stochastic  state  transition  function; 
O  :  S  X  O  — t  R.  is  a  stochastic  observation  function;  R  : 
S  x  A  — >  ffi.  is  a  reward  function;  6q  is  the  initial  belief  state; 
and  h  is  the  execution  horizon. 

A  solution  to  an  FH-POMDP  is  a  mapping  it  :  B  — >  A 
from  beliefs  to  actions,  called  a  policy.  An  optimal  policy  n* 
is  such  that 


7r*  =  arg^  maxE 


X.  R(st’  °t)  n 


0) 


In  this  work,  we  focus  on  the  particular  case  of  discrete  S, 
A,  O ,  and  deterministic  optimal  policies. 

Let  bk  :  S— >-[0, 1]  denote  the  posterior  belief  state  at  the 
fc-th  time  step.  A  belief  state  at  time  k  +  1  that  only  incorpo¬ 
rates  information  about  the  most  recent  action  ak  is  called  a 
prior  belief  state  and  denoted  by  b(sk+i\af).  If,  besides  a k, 
the  belief  state  also  incorporates  knowledge  from  the  most 
recent  observation  ok+i,  we  call  it  a  posterior  belief  state 

and  denote  it  by  b(sk+i\cik,  ok+i).  These  beliefs  can  be  re¬ 
cursively  computed  as  follows: 


b(sk+i\ak)=  Pr(sfe+i|&fc,  ak)=  ^  T(sk,ak,  sk+i)b(sk)  (2) 


n/c-j-i )  —  Pr(sfc+i  | bk:  ak , 

=  -0(sfc+i,  Ofc+i)6(sfc+i|afc),  (3) 
V 

where  T  and  O  are  from  Definition  1,  and 

1)  =  Pr(ofc+i|afe,  bk)=  ^2  0(sk+i,ok+i)b(sk+i\ak)  (4) 


is  the  probability  of  collecting  some  observation  ok+ 1  after 
executing  action  ak  at  a  belief  state  bk. 

In  addition  to  optimizing  performance,  the  next  session 
shows  how  one  can  enforce  safety  in  FH-POMDPs  by  means 
of  chance  constraints. 


2.1  Computing  risk 

A  chance  constraint  consists  of  a  bound  A  on  the  probability 
(chance)  of  some  event  happening  during  policy  execution. 
Following  (Ono,  Kuwata,  and  Balaram  2012),  we  define  this 
event  as  a  sequence  of  states  sq-m  =  So>  si>  •  •  •  >  sh  of  a  FH- 
POMDP  H  violating  one  or  more  constraints  in  a  set  C.  Let 
p  (for  “path”)  denote  a  sequence  of  states  p  =  and  let 
cv(p)  €  {0, 1}  be  an  indicator  function  such  that  cv(p)  =  1 
iff  one  or  more  states  in  p  violate  constraints  in  C.  The  latter 
implies  that  states  encompass  all  the  information  required 
to  evaluate  constraints.  With  this  notation,  we  can  write  the 
chance  constraint  as 

Ep(cv(p)\H,n)  <  A,  (5) 

One  should  notice  that  we  make  no  assumptions  about  con¬ 
straint  violations  producing  observable  outcomes,  such  as 
causing  execution  to  halt. 

Our  approach  for  incorporating  chance  constraints  into 
FH-POMDP’s  extends  that  of  (Ono  and  Williams  2008; 
Ono,  Kuwata,  and  Balaram  2012)  to  partially  observable 
planning  domains.  We  would  like  to  be  able  to  compute  in¬ 
creasingly  better  approximations  of  (5)  with  admissibility 
guarantees,  so  as  to  be  able  to  quickly  detect  policies  that 
are  guaranteed  to  violate  (5).  For  that  purpose,  let  hk  be  a 
belief  state  over  sk,  and  Sak  be  a  Bernoulli  random  variable 
denoting  whether  the  system  has  not  violated  any  constraints 
(is  “safe”)  at  time  k.  We  define 


er(bk ,  C\it)  =  1  -  Pr  (  Sen 

\  i=k 

as  the  execution  risk  of  policy  n,  as  measured  from  bk.  The 
probability  term  in  (6)  can  be  written  as 


Pr 


h 

f\  Sat 

i=k 


=  Pr 


h 

/\  Sen 

i=k-\- 1 


S&k  5  bk  5  7T 


Pr{Sak\bk,n), 


(7) 


where  Pr(S'a/c  \bk,  7r)  is  the  probability  of  the  system  not  be¬ 
ing  in  a  constraint-violating  path  at  the  fc-th  time  step.  Since 
bk  is  given,  Pr(Sak\bk,  it)  can  be  computed  as 

Pr(S'afc|6fe,7r)=l-  ^  b(sk)cv(sk,  C)=l-rb(bk,  C),  (8) 


where  rb(bkl  C )  is  called  the  risk  at  the  fc-th  step.  Note  that 
cv(sk,  C)  =  1  iff  sk  or  any  of  its  ancestor  states  violate  con¬ 
straints  in  C.  In  situations  where  the  particular  set  of  con¬ 
straints  C  is  not  important,  we  will  use  the  shorthand  nota¬ 
tion  and  er(bk |7r).  The  second  probability  term  in  (7) 
can  be  written  as 


Pr  f\  Sen 


i=k-\- 1 


S&k  ibkiTt 


bk+ i,7r  Pr(6fc+i|Safc,  7r), 


=  EPr  A  Sa< 

bfc+l  \i=fc+l 

fc  X  (1  -  er(bk+ i|7r))  Pr(bk+i\Sak,  bk,  n).  (9) 

bk  + 1 
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The  summation  in  (9)  is  over  belief  states  at  time  k  +  1. 
These,  in  turn,  are  determined  by  (3),  with  ak  =  7r (bk) 
and  some  corresponding  observation  o/,;+1 .  Therefore,  we 
have  Pv(bk+1\Sak,bk,n)  =  Pr(ofc+i|5afc,7r(6;s),6fe).  For 
the  purpose  of  computing  the  RHS  of  the  last  equation,  it  is 
useful  to  define  safe  prior  belief  as 

b  Pr(sfc+i  \Sak,  ak ,  bkf 

_  ^2sk:cv(sk,C)=0  ak,  Sk  +  l)b(Sk) 

1  -rb{bk) 

With  (10),  we  can  define 

Prsa(ofc+i|afc,  bk)=  Pr(ofe+i|S'afc,  ak,  bk ), 

=  E  0(sk+i,  ok+i)bsa (sk+i\ak),  (11) 

sfc+l 


Let  o'k+1  such  that  Prso(oJ.+1|7r(6fc),  bk)  ^  0  be  the  ob¬ 
servation  associated  with  the  child  b'k+1  of  bk.  From  (14), 
we  get 


er(b'k+  i|tt)  < 


1 

PrsaK+1|7r(fcfc)A-) 


/  A  k-rb{bk) 
V  i-ft(&fc) 


-  E  Praa{ok+i\n(bk),bk)er(bk+1\n)  \  .  (15) 

°k+i^°'k+1  ) 


The  existence  of  (15)  requires  rb(bk)  <  1  and 

PTsa{o'k+1\Tr(bk),bk)  ^  0  whenever  Pr(o,fe+1|7r(&fc),6fe)  7^ 
0.  Lemma  1  shows  that  these  conditions  are  equivalent. 

Lemma  1.  One  observes  Prsa(ok+i\TT(bk),bk)  =  0  and 
Pr(ofe+i|7r(6fc),  bk)  and  only  if,  rb(bk)  =  1. 


which  is  the  distribution  over  observations  at  time  k  +  1, 
assuming  that  the  system  was  in  a  non-violating  path  at  time 
k.  Combining  (6),  (8),  (9),  and  (11),  we  get  the  recursion 

er(bk\n)  =  rh(bk) 

+  (1  —  rb(bk))  ^ 2  Prsa(ok+i\Tv(bk),bk)er(bk+i\n),  (12) 

Ofc  +  l 


which  is  key  to  RAO*.  If  bk  is  terminal,  (12)  simplifies  to 
er{bk  |7r)  =  rb{bk)-  Note  that  (12)  uses  (11),  rather  than  (4), 
to  compute  execution  risk.  We  can  now  use  the  execution 
risk  to  express  the  chance  constraint  (5)  in  our  definition  of 
a  chance-constrained  POMDP  (CC-POMDP). 

Definition  2  (Chance-constrained  POMDP).  A  CC- 
POMDP  is  a  tuple  <  H,  C ,  A  >,  where  H  is  a  FH- 
POMDP;  C  is  a  set  of  constraints  defined  over  S;  and 
A  =  [A1, . . . ,  A9]  is  a  vector  of  probabilities  for  q  chance 
constraints 

er(ba,  C>)  <  A\  &  G  2C,  i  =  1,  2, . . . ,  q.  (13) 


The  chance  constraint  in  (13)  bounds  the  probability  of 
constraint  violation  over  the  whole  policy  execution.  Alter¬ 
native  forms  of  chance  constraints  entailing  safer  behavior 
are  discussed  in  Section  2.3.  Our  approach  for  finding  op¬ 
timal,  deterministic,  chance-constrained  solutions  for  CC- 
POMDP’s  in  this  setting  is  explained  in  Section  4. 

2.2  Propagating  risk  bounds  forward 

The  approach  for  computing  risk  in  (12)  does  so  “back¬ 
wards”,  i.e.,  risk  propagation  happens  from  terminal  states  to 
the  root  of  the  search  tree.  However,  since  we  propose  com¬ 
puting  policies  for  chance-constrained  POMDP’s  by  means 
of  heuristic  forward  search,  one  should  seek  to  propagate  the 
risk  bound  in  (13)  forward  so  as  to  be  able  to  quickly  detect 
that  the  current  best  policy  is  too  risky. 

Let  0  <  A;,.  <  1  be  the  bound  on  execution  risk  for  node 
bk.  From  (12),  we  get 

n(bk)+(l-n(bk))  E PrSa  (ok+i\n(bk),  6fe)er(6fe+i|7r)<Afc. 

°fc  +  l 

(14) 


Proof  <=  :  if  rb(bk)  =  1,  we  conclude  from  (8)  that 

cv(sk,C)  =  1  ,Vsfc.  Hence,  all  elements  in  (10)  and.  con¬ 
sequently,  (11)  will  have  probability  0. 

=>  :  from  Bayes’  rule,  we  have 


Pr(Sak\ok+i,ak,bk) 


Prsa(ofc+i|qfc, fefc)(l  -  rb(bk)) 

Pr(ok+i\ak,bk) 


Hence,  we  conclude  that  Pr(^Sak\ok+i,  a,k,bk)  =  1,  i.e., 
the  system  is  guaranteed  to  be  in  a  constraint- violating  path 
at  time  k,  yielding  Tb(bk )  =  1. 


□ 


The  execution  risk  of  nodes  whose  parents  have  r>j  ( bk )  = 
1  is  irrelevant,  as  shown  by  (12).  Therefore,  it  only  makes 
sense  to  propagate  risk  bounds  in  cases  where  rb(bk)  <  1. 

One  difficulty  associated  with  (15)  is  that  it  depends  on 
the  execution  risk  of  all  siblings  of  b'k+1,  which  cannot  be 
computed  exactly  until  terminal  nodes  are  reached.  There¬ 
fore,  one  must  approximate  (15)  in  order  to  render  it  com¬ 
putable  during  forward  search. 

We  can  easily  define  a  necessary  condition  for  feasi¬ 
bility  of  a  chance  constraint  at  a  search  node  by  means 
of  an  admissible  execution  risk  heuristic  her{bk+ 1  |tt)  < 
er(bk+ iK).  Combining  her(-)  and  (15)  provides  us  with  a 
necessary  condition 


er(b'k+1  |tt)  < 


1 

Prs“(o'fc+1k(6fc)A) 


/  A  k-rh(bk) 
V  l-n(bk) 


-  E  Prsa(ofc+i|7r(6fc),6i:)/ier(&fc+i|7r)  J  .  (16) 

°fc+i?H+i  / 


Since  her(bk+  i|7r)  computes  a  lower  bound  on  the  ex¬ 
ecution  risk,  we  conclude  that  (16)  gives  an  upper  bound 
for  the  true  execution  risk  bound  in  (15).  The  simplest  pos¬ 
sible  heuristic  is  her{bk+i\tr)  =  0,  \/bk+ 1,  which  assumes 
that  it  is  absolutely  safe  to  continue  executing  policy  n  be¬ 
yond  bk.  Moreover,  from  the  non-negativity  of  the  terms  in 
(12),  we  see  that  another  possible  choice  of  a  lower  bound 
is  her(bk. |_i|7r)  =  rb{bk+ 1),  which  is  guaranteed  to  be  an 
improvement  over  the  previous  heuristic,  for  it  incorporates 
additional  information  about  the  risk  of  failure  at  that  belief 
state.  However,  it  is  still  a  myopic  risk  estimate,  given  that  it 
ignores  the  execution  risk  for  nodes  beyond  bk+\.  All  these 
bounds  can  be  compute  forward,  starting  with  Aq  =  A. 
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2.3  Enforcing  safe  behavior  at  all  times 

Enforcing  (13)  bounds  the  probability  of  constraint  violation 
over  total  policy  executions,  but  (15)  shows  that  unlikely 
policy  branches  can  be  allowed  risks  close  or  equal  to  1  if 
that  will  help  improve  the  objective,  giving  rise  to  a  “dare¬ 
devil”  attitude.  Since  this  might  not  be  the  desired  risk-aware 
behavior,  a  straightforward  way  of  achieving  higher  levels  of 
safety  is  to  depart  from  the  chance  constraints  in  (13)  and, 
instead,  impose  a  set  of  chance  constraints  of  the  form 

er(bk,  Cl\n)  <  A \  Vi,  bk  s.t.  bk  is  nonterminal.  (17) 


Intuitively,  (17)  tells  the  autonomous  agent  to  “remain 
safe  at  all  times”,  whereas  the  message  conveyed  by  (13) 
is  “stay  safe  overall”.  It  should  be  clear  that  (17)=>(13),  so 
(17)  necessarily  generates  safer  policies  than  (13),  but  also 
more  conservative  in  terms  of  utility.  Another  possibility  is 
to  follow  (Ono,  Kuwata,  and  Balaram  2012)  and  impose 

h 

<  A*,  Vi,  (18) 

k= 0 

which  is  a  sufficient  condition  for  (13)  based  on  Boole’s 
inequality.  One  can  show  that  (18)=>(17),  so  enforcing  (18) 
will  lead  to  policies  that  are  at  least  as  conservative  as  (17). 

3  Relation  to  constrained  POMDP’s 

Alternative  approaches  for  chance-constrained  POMDP 
planning  have  been  presented  in  (Undurti  and  How  2010) 
and  (Poupart  et  al.  2015),  where  the  authors  propose  al¬ 
gorithms  for  solving  constrained  POMDP’s  (C-POMDP’s). 
They  argue  that  chance  constraints  can  be  modeled  within 
the  C-POMDP  framework  by  assigning  unit  costs  to  states 
violating  constraints,  0  to  others,  and  proceeding  with  cal¬ 
culations  as  usual. 

There  are  two  main  shortcomings  associated  with  the  use 
of  unit  costs  to  deal  with  chance  constraints.  First,  it  only 
yields  correct  measures  of  execution  risk  in  the  particular 
case  where  constraint  violations  cause  policy  execution  to 
terminate.  If  that  is  not  the  case,  incorrect  probability  values 
can  be  attained,  as  shown  in  the  simple  example  in  Figure  1 . 
Second,  assuming  that  constraint  violations  cause  execution 
to  cease  has  a  strong  impact  on  belief  state  computations. 
The  key  insight  here  is  that  assuming  that  constraint  viola¬ 
tions  cause  execution  to  halt  provides  the  system  with  an 
invaluable  observation:  at  each  non-terminal  belief  state,  the 
risk  rk(bk,  C)  in  (8)  must  be  0.  The  reason  for  that  is  simple: 
(constraint  violation  =>  terminal  belief )  •£=>  (non-terminal 
belief  =>  no  constraint  violation ). 

Assuming  that  policy  execution  terminates  at  constraint 
violations  is  reasonable  when  undesirable  states  are  destruc¬ 
tive,  e.g.,  the  agent  is  destroyed  after  crashing  against  an 
obstacle.  Nevertheless,  it  is  rather  limiting  in  terms  of  ex¬ 
pressiveness,  since  there  are  application  domains  where  un¬ 
desirable  states  can  be  “benign”.  For  instance,  in  the  power 
supply  restoration  domain  described  in  the  experimental  sec¬ 
tion,  connecting  faults  to  generators  is  undesirable  and  we 
want  to  limit  the  probability  of  this  event.  However,  it  does 
not  destroy  the  network.  In  fact,  it  might  be  the  only  way  to 
significantly  reduce  the  uncertainty  about  the  location  of  a 


Si  S2  Si  S2 

(a)  Incorrect  execution  risks  (b)  Correct  execution  risks 
computed  using  unit  costs.  computed  according  to  (12). 


Figure  1:  Modeling  chance  constraints  via  unit  costs 
may  yield  incorrect  results  when  constraint-violating  states 
(dashed  outline)  are  not  terminal.  Numbers  within  states  are 
constraint  violation  probabilities.  Numbers  over  arrows  are 
probabilities  for  a  non-deterministic  action. 


load  fault,  therefore  allowing  for  a  larger  amount  of  power 
to  be  restored  to  the  system. 

4  Solving  CC-POMDP’s  through  RAO* 

In  this  section,  we  introduce  the  Risk-bounded  AO*  algo¬ 
rithm  (RAO*)  for  constructing  risk-bounded  policies  for 
CC-POMDP’s.  RAO*  is  based  on  heuristic  forward  search 
in  the  space  of  belief  states.  The  motivation  for  this  is  sim¬ 
ple:  given  an  initial  belief  state  and  limited  resources  (in¬ 
cluding  time),  the  number  of  reachable  belief  states  from  a 
set  of  initial  conditions  is  usually  a  very  small  fraction  of  the 
total  number  of  possible  belief  states.  Another  reason  is  that 
there  might  not  be  a  clear  concept  of  a  “goal  state”,  which 
makes  it  hard  to  perform  goal  regression. 

Similar  to  AO*  in  fully  observable  domains,  RAO*  (Al¬ 
gorithm  1)  explores  its  search  space  of  belief  states  from  the 
initial  belief  bf)  by  incrementally  constructing  a  hypergraph 
G  called  the  explicit  hypergraph.  Each  node  in  G  represents 
a  belief  state,  and  a  hyperedge  is  a  compact  representation  of 
the  process  of  taking  an  action  and  receiving  any  of  a  num¬ 
ber  of  possible  observations.  Each  node  in  G  is  associated 
with  the  Q  value 


Qibk^Ojjf] — ^  j  R(sk i  CLk ')b(sfc ) T  ^  ^  Pr(ofc+i|ufc,  bk)Q  ((^fc+i) 

Sfc  Ofc  +  1 

(19) 

representing  the  expected,  cumulative  reward  of  taking  ac¬ 
tion  cik  at  some  belief  state  bk.  The  first  term  corresponds  to 
the  expected  current  reward,  while  the  second  term  is  the  ex¬ 
pected  reward  obtained  by  following  the  optimal  determin¬ 
istic  policy  7 r*,  i.e.,  Q*(bk+i)  =  Q(bk+1, 7r*  (bk+i))-  Given 
an  admissible  estimate  liQ{bk+ 1)  of  Q*(bk+ 1),  we  select  ac¬ 
tions  for  the  current  estimate  tt  of  ir*  according  to 

Tt(bk)  =  arg  max  Q(bk,ak),  (20) 

& k 

where  Q(bk,  ak )  is  the  same  as  (19)  with  Q*(bk+ 1)  replaced 
by  h.Q(bk+ 1).  The  portion  of  G  corresponding  to  the  current 
estimate  7r  of  7r*  is  called  the  greedy  graph,  for  it  uses  an 
admissible  heuristic  estimate  h,Q(bk,  ak )  of  Q*(bk+ 1)  to  ex¬ 
plore  the  most  promising  areas  of  G  first. 

The  most  important  differences  between  AO*  and  RAO* 
lie  in  Algorithms  2  and  3.  First,  since  RAO*  deals  with  par¬ 
tially  observable  domains,  node  expansion  in  Algorithm  2 
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Algorithm  1  RAO* 

Input:  CC-POMDP  H,  initial  belief  bo- 

Output:  Optimal  policy  tt  mapping  beliefs  to  actions. 

1:  Explicit  graph  G  and  policy  ir  initially  consist  of  bo. 
2:  while  n  has  some  nonterminal  leaf  node  do 
3:  n,  G  <—  expand-policy(G,  n) 

4:  7r  i —  update-policy(n,  G,  tt) 

5:  return  tt. 


Algorithm  2  expand-policy 

Input:  Explicit  graph  G,  policy  n. 

Output:  Expanded  explicit  G' ,  expanded  leaf  node  n. 

1:  G'  <r-  G,n  <—  choose-promising-leaf(G,  n) 

2:  for  each  action  a  available  at  n  do 

3:  ch  «—  use  (2),  (3),  (4)  to  expand  children  of  (n,  a). 

4:  Vc  £  ch,  use  (8),  (11),  (12),  and  (19)  with  admissible 

heuristics  to  estimate  Q*  and  er. 

5:  Vc  £  ch,  use  (16)  to  compute  exec,  risk  bounds 

6:  if  no  c  £  ch  violates  its  risk  bound  then 

7:  G'  <—  add  hyperedge  [(n,  a)  —¥  ch] 

8:  if  no  action  added  to  n  then  mark  n  as  terminal. 

9:  return  G' ,n. 


involves  full  Bayesian  prediction  and  update  steps,  as  op¬ 
posed  to  a  simple  branching  using  the  state  transition  func¬ 
tion  T.  In  addition,  RAO*  leverages  the  heuristic  estimates 
of  execution  risk  explained  in  Section  2.2  in  order  to  perform 
early  pruning  of  actions  that  introduce  child  belief  nodes  that 
are  guaranteed  to  violate  the  chance  constraint.  The  same 
process  is  also  observed  during  policy  update  in  Algorithm 
3,  in  which  heuristic  estimates  of  the  execution  risk  are  used 
to  prevent  RAO*  to  keep  choosing  actions  that  are  promis¬ 
ing  in  terms  of  heuristic  value,  but  can  be  proven  to  violate 
the  chance  constraint  at  an  early  stage. 

The  proofs  of  soundness,  completeness,  and  optimality 
for  RAO*  are  given  in  Lemma  2  and  Theorem  1. 

Lemma  2.  Risk-based  pruning  of  actions  in  Algorithms  2 
(line  6)  and  3  (line  7)  is  sound. 

Proof.  The  RHS  of  (15)  is  the  true  execution  risk  bound  for 
er(b'k+1\n).  The  execution  risk  bound  on  the  RHS  of  (16) 
is  an  upper  bound  for  the  bound  in  (15),  since  we  replace 


Algorithm  3  update-policy 

Input:  Expanded  n,  explicit  graph  G.  policy  7r. 

Output:  Updated  policy  n' . 

1:  Z  «—  set  containing  n  and  its  ancestors  reachable  by  n. 

2:  while  Z  f  0  do 

3:  n  <—  remove(Z)  node  n  with  no  descendant  in  Z. 

4:  while  there  are  actions  to  be  chosen  at  n  do 

5:  a  <—  next  best  action  at  n  according  to  (20)  satisfying 

exec,  risk  bound. 

6:  Propagate  execution  risk  bound  of  n  to  the  children  of 

the  hyperedge  ( n ,  a) 

7:  if  no  children  violates  its  exec,  risk  bound  then 

8:  7r (n)  £-  a;  break 

9:  if  no  action  was  selected  at  n  then  mark  n  as  terminal 


er(bk- )_i  1 7r)  for  the  siblings  of  b'k+1  by  admissible  estimates 
(lower  bounds)  her(bk+ i|tt).  In  the  aforementioned  prun¬ 
ing  steps,  we  compare  her(bk+1\n),  a  lower  bound  on  the 
true  value  er(b'k+1  |7r),  to  the  upper  bound  (16).  Verifying 
her(b'k_ ,  r  |7r)  >  (16)  is  sufficient  to  establish  er(b'k+1\ir)  > 
(15),  i.e.,  action  a  currently  under  consideration  is  guaran¬ 
teed  to  violate  the  chance  constraint.  □ 

Theorem  1.  RAO*  is  complete  and  produces  the  optimal  de¬ 
terministic,  finite-horizon  policies  meeting  the  chance  con¬ 
straints. 

Proof.  A  CC-POMDP,  as  described  in  Definition  2,  has  a 
finite  number  of  policy  branches,  and  Lemma  2  shows  that 
RAO*  only  prunes  policy  branches  that  are  guaranteed  not 
to  be  part  of  any  chance-constrained  solution.  Therefore,  if 
no  chance-constrained  policy  exists,  RAO*  will  eventually 
return  an  empty  policy. 

Concerning  the  optimality  of  RAO*  with  respect  to 
the  utility  function,  it  follows  from  the  admissibility  of 
hq(bk,  dk)  in  (20)  and  the  optimality  guarantee  of  AO*.  □ 

5  Experiments 

This  section  provides  empirical  evidence  of  the  usefulness 
and  general  applicability  of  CC-POMDP’s  as  modeling  tool 
for  risk-sensitive  applications,  and  shows  how  RAO*  per¬ 
forms  when  computing  risk-bounded  policies  in  two  chal¬ 
lenging  domains  of  practical  interest:  power  supply  restora¬ 
tion  (PSR)  (Thiebaux  and  Cordier  2001)  and  automated 
planning  for  science  agents  (SA)  (Benazera  et  al.  2005).  All 
models  and  RAO*  were  implemented  in  Python  and  ran  on 
an  Intel  Core  i7-2630QM  CPU  with  8GB  of  RAM. 

In  the  PSR  domain  (Thiebaux  and  Cordier  2001),  the  ob¬ 
jective  is  to  reconfigure  a  faulty  power  network  by  switch¬ 
ing  lines  on  or  off  so  as  to  resupply  as  many  customers 
as  possible.  One  of  the  safety  constraints  is  to  keep  faults 
isolated  at  all  times,  to  avoid  endangering  people  and  en¬ 
larging  the  set  of  areas  left  without  power.  However,  fault 
locations  are  hidden,  and  more  information  cannot  be  ob¬ 
tained  without  taking  the  risk  of  resupplying  a  fault.  There¬ 
fore,  the  chance  constraint  is  used  to  limit  the  probability 
of  connecting  power  generators  to  faulty  buses.  Our  exper¬ 
iments  focused  on  the  semi-rural  network  from  (Thiebaux 
and  Cordier  2001),  which  was  significantly  beyond  the  reach 
of  (Bonet  and  Thiebaux  2003)  even  for  single  faults.  In  our 
experiments,  there  were  always  circuit  breakers  at  each  gen¬ 
erator,  plus  different  numbers  of  additional  circuit  breakers 
depending  on  the  experiment.  Observations  correspond  to 
circuit  breakers  being  open  or  closed,  and  actions  to  opening 
and  closing  switches.  The  PSR  domain  is  strongly  combina¬ 
torial,  with  |«S|  =  261;  |A|  =  68,  \0\  =  32. 

Our  SA  domain  is  based  on  the  planetary  rover  scenario 
described  in  (Benazera  et  al.  2005).  Starting  from  some  ini¬ 
tial  position  in  a  map  with  obstacles,  the  science  agent  may 
visit  four  different  sites  on  the  map,  each  of  which  could 
contain  new  discoveries  with  probability  based  on  a  prior 
belief.  If  the  agent  visits  a  location  that  contains  new  dis¬ 
coveries,  it  will  find  it  with  high  probability.  The  agent’s 
position  is  uncertain,  so  there  is  always  a  non-zero  risk  of 


DISTRIBUTION  A:  Distribution  approved  for  public  release. 


collision  when  the  agent  is  traveling  between  locations.  The 
agent  is  required  to  finish  its  mission  at  a  relay  station,  where 
it  can  communicate  with  an  orbiting  satellite  and  transmit 
its  findings.  Since  the  satellite  moves,  there  is  a  limited  time 
window  for  the  agent  to  gather  as  much  information  as  pos¬ 
sible  and  arrive  at  the  relay  station.  Moreover,  we  assume  the 
duration  of  each  traversal  to  be  uncontrollable,  but  bounded. 
In  this  domain,  we  use  a  single  chance  constraint  to  ensure 
that  the  event  “arrives  at  the  relay  location  on  time”  happens 
with  probability  at  least  1  —  A.  The  SA  domain  has  size 
|«S|  =  6144; \A\  =  34,  \0\  =  10. 

We  evaluated  the  performance  of  RAO*  in  both  domains 
under  various  conditions,  and  the  results  are  summarized  in 
Tables  1  (higher  utility  is  better)  and  2  (lower  cost  is  bet¬ 
ter).  It  is  worthwhile  to  mention  that  constraint  violations  in 
PSR  do  not  cause  execution  to  terminate,  and  the  same  is 
true  for  scheduling  violations  in  SA.  The  only  type  of  ter¬ 
minal  constraint  violation  are  collisions  in  SA,  and  RAO* 
makes  proper  use  of  this  extra  bit  of  information  to  update 
its  beliefs.  Therefore,  PSR  and  SA  are  examples  of  risk- 
sensitive  domains  which  can  be  appropriately  modeled  as 
CC-POMDP’s,  but  not  as  C-POMDP’s  with  unit  costs.  The 
heuristics  used  were  straightforward:  for  the  execution  risk, 
we  used  the  admissible  heuristic  her(bk |7r)  =  rb(bk)  in  both 
domains.  For  Q  values,  the  heuristic  for  each  state  in  PSR 
consisted  in  the  final  penalty  incurred  if  only  its  faulty  nodes 
were  not  resupplied,  while  in  SA  it  was  the  sum  of  the  utili¬ 
ties  of  all  non- visited  discoveries. 

As  expected,  both  tables  show  that  increasing  the  maxi¬ 
mum  amount  of  risk  A  allowed  during  execution  can  only 
improve  the  policy’s  objective.  The  improvement  is  not 
monotonic,  though.  The  impact  of  the  chance  constraint  on 
the  objective  is  discontinuous  on  A  when  only  determinis¬ 
tic  policies  are  considered,  since  one  cannot  randomly  select 
between  two  actions  in  order  to  achieve  a  continuous  inter¬ 
polation  between  risk  levels.  Being  able  to  compute  increas¬ 
ingly  better  approximations  of  a  policy’s  execution  risk, 
combined  with  forward  propagation  of  risk  bounds,  also  al¬ 
low  RAO*  to  converge  faster  by  quickly  pruning  candidate 
policies  that  are  guaranteed  to  violate  the  chance  constraint. 
This  can  be  clearly  observed  in  Table  2  when  we  move  from 
A  =  0.5  to  A  =  1.0  (no  chance  constraint). 

Another  important  aspect  is  the  impact  of  sensor  infor¬ 
mation  on  the  performance  of  RAO* .  Adding  more  sources 
of  sensing  information  increases  the  branching  on  the  search 
hypergraph  used  by  RAO*,  so  one  could  expect  performance 
to  degrade.  However,  that  is  not  necessarily  the  case,  as 
shown  by  the  left  and  right  numbers  in  the  cells  of  Table 
2.  By  adding  more  sensors  to  the  power  network,  RAO* 
can  more  quickly  reduce  the  size  of  its  belief  states,  there¬ 
fore  leading  to  a  reduced  number  of  states  evaluated  during 
search.  Another  benefit  of  reduced  belief  states  is  that  RAO* 
can  more  effectively  reroute  energy  in  the  network  within 
the  given  risk  bound,  leading  to  lower  execution  costs. 

Finally,  we  wanted  to  investigate  how  well  a  C-POMDP 
approach  would  perform  in  these  domains  relative  to  a  CC- 
POMDP.  Following  the  literature,  we  made  the  additional 
assumption  that  execution  halts  at  all  constraint  violations, 
and  assigned  unit  terminal  costs  to  those  search  nodes.  Re¬ 


sults  on  two  example  instances  of  PSR  and  S  A  domains  were 
the  following:  I)  in  SA,  C-POMDP  and  CC-POMDP  both  at¬ 
tained  an  utility  of  29.454;  II)  in  PSR,  C-POMDP  reached  a 
final  cost  of  53.330,  while  CC-POMDP  attained  36.509.  The 
chance  constraints  were  always  identical  for  C-POMDP  and 
CC-POMDP.  First,  one  should  notice  that  both  models  had 
the  same  performance  in  the  SA  domain,  which  is  in  agree¬ 
ment  with  the  claim  that  they  coincide  in  the  particular  case 
were  all  constraint  violations  are  terminal.  The  same,  how¬ 
ever,  clearly  does  not  hold  in  the  PSR  domain,  where  the  C- 
POMDP  model  had  significantly  worse  performance  than  its 
corresponding  CC-POMDP  with  the  exact  same  parameters. 
Assuming  that  constraint  violations  are  terminal  in  order  to 
model  them  as  costs  greatly  restricts  the  space  of  potential 
solution  policies  in  domains  with  non-destructive  constraint 
violations,  leading  to  conservatism.  A  CC-POMDP  formu¬ 
lation,  on  the  other  hand,  can  potentially  attain  significantly 
better  performance  while  offering  the  same  safety  guarantee. 


Window[s] 

A 

Timefs] 

Nodes 

States 

Utility 

20 

0.05 

1.30 

1 

32 

0.000 

30 

0.01 

1.32 

1 

32 

0.000 

30 

0.05 

49.35 

83 

578 

29.168 

40 

0.002 

9.92 

15 

164 

21.958 

40 

0.01 

44.86 

75 

551 

29.433 

40 

0.05 

38.79 

65 

443 

29.433 

100 

0.002 

95.23 

127 

1220 

24.970 

100 

0.01 

184.80 

161 

1247 

29.454 

100 

0.05 

174.90 

151 

1151 

29.454 

Table  1 :  S  A  results  for  various  time  windows  and  risk  levels. 


A 

Time[s] 

Nodes 

States 

Cost 

0 

0.025/0.013 

1.57/1.29 

5.86/2.71 

45.0/30.0 

.5 

0.059/0.014 

3.43/1.29 

10.71/2.71 

44.18/30.0 

1 

2.256/0.165 

69.3/11.14 

260.4/23.43 

30.54/22.89 

0 

0.078/0.043 

2.0/1.67 

18.0/8.3 

84.0/63.0 

.5 

0.157/0.014 

3.0/1.29 

27.0/2.71 

84.0/30.0 

1 

32.78/0.28 

248.7/5.67 

1340/32.33 

77.12/57.03 

0 

1.122/0.093 

7.0/2.0 

189.0/12.0 

126.0/94.50 

.5 

0.613/0.26 

4.5/4. 5 

121.5/34.5 

126.0/94.50 

1 

123.9/51.36 

481.5/480 

8590.5/2648 

117.6/80.89 

Table  2:  PSR  results  for  various  numbers  of  faults  (#)  and 
risk  levels.  Top:  avg.  of  7  single  faults.  Middle:  avg.  of  3 
double  faults.  Bottom:  avg.  of  2  triple  faults.  Left  (right) 
numbers  correspond  to  12  (16)  network  sensors. 


6  Conclusions 

We  have  presented  RAO*,  an  algorithm  for  optimally  solv¬ 
ing  CC-POMDP’s.  By  combining  the  advantages  of  AO* 
in  the  belief  space  with  forward  propagation  of  risk  upper 
bounds,  RAO*  is  able  to  solve  challenging  risk-sensitive 
planning  problems  of  practical  interest  and  size.  Our  agenda 
for  future  work  includes  generalizing  the  algorithm  to  move 
away  from  the  finite  horizon  setting,  as  well  as  more  general 
chance  constraints,  including  temporal  logic  path  constraints 
(Teichteil-Konigsbuch  2012). 
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